In-band full-duplex systems can transmit and receive information simultaneously on the same frequency band. However, due to the strong self-interference caused by the transmitter to its own receiver, the use of non-linear digital selfinterference cancellation is essential. In this work, we describe a hardware architecture for a neural network-based non-linear self-interference (SI) canceller and we compare it with our own hardware implementation of a conventional polynomial based SI canceller. In particular, we present implementation results for a shallow and a deep neural network SI canceller as well as for a polynomial SI canceller. Our results show that the deep neural network canceller achieves a hardware efficiency of up to 312.8 Msamples/s/mm 2 and an energy efficiency of up to 0.9 nJ/sample, which is 2.1× and 2× better than the polynomial SI canceller, respectively. These results show that NNbased methods applied to communications are not only useful from a performance perspective, but can also be a very effective means to reduce the implementation complexity.
I. INTRODUCTION
In-band full-duplex (FD) communications have for long been considered to be impractical due to the strong selfinterference (SI) caused by the transmitter to its own receiver. However, recent work on the topic (e.g., [2] , [3] , [4] ) has demonstrated that it is, in fact, possible to achieve sufficient SI cancellation to make FD systems viable. Typically, SI cancellation is performed in both the radio frequency (RF) domain and the digital domain to cancel the SI signal down to the level of the receiver noise floor. There are several ways to achieve RF cancellation that can be broadly categorized into passive RF cancellation and active RF cancellation. Some form of RF cancellation is generally necessary to avoid saturating the analog front-end of the receiver. Passive RF cancellation can be obtained by using, e.g., circulators, directional antennas, beamforming, polarization, or shielding. Active RF cancellation is commonly implemented by transforming the transmitted RF signal appropriately to emulate the SI channel using analog components and subtracting the resulting SI cancellation signal Y. Kurzo is with ON Semiconductor, 2074 Marin-Epagnier, Switzerland (e-mail: yann.kurzo@gmail.com).
A. Kristensen and A. Burg are with the Telecommunications Circuits Laboratory,École polytechnique fédérale de Lausanne, 1015 Lausanne, Switzerland (e-mail: {andreas.kristensen,andreas.burg}@epfl.ch).
A. Balatsoukas-Stimming is with the Eindhoven University of Technology, 5600 MB Eindhoven, The Netherlands (e-mail: a.k.balatsoukas.stimming@tue.nl).
Parts of this work were presented at the 2018 Asilomar Conference on Signals, Systems, and Computers [1] . This work was supported by the Swiss National Science Foundation under project #200021 182621. from the received SI signal [2] , [4] . Alternatively, an additional transmitter can be used to generate the SI cancellation signal from the transmitted baseband samples [3] .
However, a residual SI signal is typically still present at the receiver after RF cancellation has been performed. This residual SI signal can, in principle, be easily canceled in the digital domain, since it is caused by a known transmitted signal. Unfortunately, in practice, several transceiver non-linearities distort the SI signal. Some examples of non-linearities include baseband non-linearities (e.g., digital-to-analog converter (DAC) and analog-to-digital converter (ADC)) [5] , IQ imbalance [5] , [6] , phase-noise [7] , [8] , and power amplifier (PA) non-linearities [5] , [6] , [9] , [10] . These effects need to be taken into account using intricate polynomial models to cancel the SI to the level of the receiver noise floor. These polynomial models perform well in practice, but their implementation complexity grows rapidly with the maximum considered nonlinearity order. Principal component analysis (PCA) is an effective complexity reduction technique that can identify the most significant non-linearity terms in a parallel Hammerstein model [10] . However, with PCA-based methods, the transmitted digital baseband samples need to be multiplied with a transformation matrix to generate the SI cancellation signal, thus introducing additional complexity. Moreover, whenever the SI channel changes, the high-complexity PCA operation needs to be re-run. To the best of our knowledge, no hardware implementation of a polynomial SI canceller has been reported in the open literature to date. Only the work of [11] has made a step in this direction, since the authors considered quantization aspects of polynomial SI cancellers.
In the past few years, there has been renewed interest in the use of neural networks (NNs) to augment or replace a range of signal processing tasks in communications systems [12] , [13] , [14] , [15] , [16] , [17] . NNs are particularly well-suited to tackle non-linear signal processing problems, where traditional model-based algorithms are unavailable or too complex for analytical treatment. However, NN-based solutions can also be used in cases where traditional model-based algorithms suffer from prohibitively high implementation complexity. For example, NNs have been used successfully to perform digital predistortion (DPD) in wireless systems [18] , [19] , nonlinear leakage cancellation in FDD transceivers [20] , as well as optical fiber non-linearity compensation [21] . NNs have also been used for non-linear SI cancellation in full-duplex communications [22] , [23] , [24] and it was shown in [22] that they can achieve similar SI cancellation performance with a state-of-the-art polynomial SI cancellation model, but with much lower computational complexity.
Existing NN hardware accelerators, such as [25] , [26] , mainly target applications where both the size of the NN and the number of inputs is very large, and where producing a few tens of outputs per second is sufficient. Communications applications, on the other hand, use relatively small NNs with few inputs, but need to provide millions of outputs per second. As such, communications applications generally require different and more specialized NN hardware accelerator architectures. However, to date, only a small number of works have considered these hardware-related issues in the context of communications applications. Specifically, the works of [27] , [28] study NN quantization as a first step towards hardware implementation, while the authors of [1] , [29] describe actual hardware implementations of simple NNs for SI cancellation in full-duplex communications and DPD, respectively.
Contribution: In this work, we present a hardware implementation of the SI cancellation method proposed in [22] to quantify and translate the computational complexity gains over the state-of-the-art polynomial based model of [10] into realworld hardware resource utilization gains. Moreover, we also implement an instance of the deep NN canceller proposed in [24] which leads to a significant additional complexity reduction compared with the shallow NN canceller of [22] . Since, to the best of our knowledge, no polynomial SI canceller implementations have been reported in the literature, we also present a hardware architecture for a reference polynomial SI canceller. We note that this hardware architecture can also be used for other related applications such as digital predistortion and leakage cancelletion in FDD transceivers. We provide FPGA and ASIC implementation results that clearly demonstrate the significant gains with respect to the polynomial SI canceller that can be achieved by both the shallow and the deep NN-based SI cancellers in terms of resource utilization, throughput, and energy efficiency.
Outline: The remainder of this paper is organized as follows. Section II provides background on full-duplex communications and digital SI cancellation using polynomial cancellers, while Section III describes how SI cancellation can be achieved using NNs. In Section IV we describe our proposed NN-based SI canceller hardware architecture and in Section V we describe our proposed baseline polynomialbased SI canceller hardware architecture. In Section VI, we compare the performance and the complexity of a conventional polynomial SI canceller with the NN-based SI cancellers. In Section VI we also provide FPGA and ASIC implementation results. Finally, Section VII concludes this paper. Fig. 1 shows a block diagram of a full-duplex transceiver. On the transmitter side, the digital baseband samples x[n] ∈ C, where n is the sample index, are converted to an analog signal using a digital-to-analog converter (DAC), up-converted to a carrier frequency f c using an IQ mixer, amplified using a power amplifier (PA), and filtered using a bandpass (BP) filter. The transmitted signal leaks to the receiver through an SI channel h SI and is then filtered using a BP filter, amplified using an LNA, downconverted using an IQ mixer, and digitized using an analog-to-digital-converter (ADC). The SI channel h SI also models the passive RF SI cancellation. An RF cancellation signal is subtracted from the received SI signal at some point before the LNA to avoid saturating the receiver. Since the transmitter and the receiver are co-located, they share a common local oscillator (LO) signal in order to minimize the effect of phase noise on the SI signal.
II. CONVENTIONAL DIGITAL SELF-INTERFERENCE CANCELLATION
If we assume, for simplicity of exposition, that there is no signal-of-interest from a remote node and no thermal noise, then the received signal y[n] in Fig. 1 consists only of the residual SI signal after RF SI cancellation has been performed. We denote the received signal in this special case by y SI [n]. The goal of digital SI cancellation is to reproduce an accurate copy of y SI [n], denoted byŷ SI [n], based on samples of the transmitted baseband signal x[n]. This signal is then subtracted from y[n] so that the residual SI signal is y SI [n] −ŷ SI [n]. If y SI [n] is reconstructed perfectly, then the SI can be canceled entirely and y SI [n] −ŷ SI [n] = 0. In practice, as discussed previously, due to the presence of thermal noise and transceiver non-linearities, perfect SI cancellation is difficult to achieve. The SI cancellation performance C dB is typically evaluated as:
(1)
A. Linear Self-Interference Cancellation
Linear SI cancellation is the simplest form of SI cancellation that ignores all non-linear effects of the various components in Fig. 1 . The linear SI cancellation signal is constructed as:
whereĥ[l] ∈ C, l ∈ {0, . . . , L − 1}, models the SI channel h SI and any other memory effect in the transceiver chain. The parametersĥ[l] can be obtained from training samples either in a one-shot fashion using standard least-squares (LS) estimation or adaptively using an iterative version of the LS estimation algorithm, such as least mean squares (LMS) or recursive least squares (RLS).
B. Polynomial Non-Linear Self-Interference Cancellation
Each active component in the transceiver model shown in Fig. 1 is generally a dynamic non-linear system. This means that linear cancellation alone is, in most cases, not accurate enough to cancel a sufficiently large fraction of the SI signal. It has been shown that the transmitter IQ imbalance and the PA non-linearities typically dominate all remaining non-linearities [9] , [10] . This is true in particular when the transmitter and receiver chains use the same local oscillator signal for upconversion, as shown in Fig. 1 of phase noise becomes negligible. As such, the SI cancellation signalŷ SI [n] can be constructed as [9] , [10] :
where the parametersĥ p [l] ∈ C, l ∈ {0, . . . , L − 1}, p ∈ {1, 3, . . . , P }, model the joint effect ofĥ SI [l] and the memory effects introduced by the PA for the harmonic of order p, and K 1 and K 2 are parameters that model the IQ imbalance.
Only odd values for p are considered because even harmonics typically lie out-of-band and are filtered out by the transmitter and receiver BP filters. With some arithmetic manipulationŝ y SI [n] can be re-written as [9] , [10] :
where the parametersĥ p,q [l] ∈ C capture the joint effect of h p [l] and of the IQ imbalance parameters K 1 and K 2 . The model in (4) is linear with respect to the parametersĥ p,q [l], and therefore, similarly to linear SI estimation, the parameterŝ h p,q [l] can be estimated based on training samples using some variant of the LS estimation algorithm. The basis functions of the polynomial model in (4) are defined as:
The number of distinct basis functions in (4) is [10] :
Using (5), the expression forŷ[n] in (4) can be re-written in a more compact form:
We note that linear cancellation is a special case of the polynomial model in (7) when only considering the single term for p = 1 and q = 1.
C. Computational Complexity
Assuming that each complex multiplication is implemented using three real-valued multiplications and five real-valued additions and that each complex addition is implemented using two real-valued additions, it can directly be deduced from (2) that the total number of real-valued multiplications and additions that are required by the linear SI canceller is:
Moreover, if we ignore the computation of the basis functions for simplicity, 1 the total number of real-valued multiplications and additions that are required by the polynomial SI canceller (which also includes the linear cancellation term) is [22] :
N MUL,poly = 3 4 L (P + 1) (P + 3) .
We note that the expression for N ADD,poly in our previous work of [22] erroneously ignored the five real-valued additions that are required to implement each complex multiplication. As such, the actual complexity of the polynomial canceller is even higher than that reported in [22] .
III. NEURAL NETWORK NON-LINEAR DIGITAL SELF-INTERFERENCE CANCELLATION
Polynomial SI cancellation models such as (7) work well in practice but are often highly redundant in the sense that many of theĥ p,q [l] parameters are very close to zero. NN-based SI cancellers, on the other hand, can extract the essence of the non-linear structure of the SI signal from training data, which often significantly reduces the complexity of the SI cancellation model [22] . A challenge when using NN cancellers is that the NN training process is inherently noisy due to the use of mini-batches for gradient estimation, which makes it difficult to achieve a very accurate reconstruction of the SI signal [24] . To overcome this problem, [22] used a NN to reconstruct only a particular part of the SI signal, while using conventional linear cancellation for the remainder of the SI. Specifically, in [22] the SI signal was conceptually decomposed into a linear component and a non-linear component:
The SI cancellation is carried out in two steps. First, linear cancellation is used in order to reconstructŷ SI,linear [n] as:
The parametersĥ[l] are obtained using LS estimation while considering the substantially weaker signalŷ SI,nl [n] as noise.
The linear SI cancellation signal is then subtracted from the SI signal in order to obtain:
The task of the NN is limited to reconstructing y SI,nl [n] based on the appropriate x[n] samples.
As is common practice when training NNs, we normalize the input and output training samples so that x[n] and y SI,nl [n] have unit variance (i.e., the variance of the real part and the variance of the imaginary part are both equal to 0.5) and zero mean. To perform SI cancellation on the test data, the output of the NN is denormalized using the mean and variance estimated based on the training data.
A. Neural Network Structure
Due to the universal approximation theorem [30] , a feedforward NN with one hidden layer, as depicted in Fig. 2 , is sufficient to reconstruct the non-linear SI signal. While the work of [22] only considered feedforward NNs with one hidden layer, it is possible to use any NN architecture to generateŷ SI,nl [n]. In particular, [24] employed a deep feedforward NN and showed that using many layers with few neurons per layer has significant computational complexity advantages with respect to a shallow NN SI canceller that uses a single layer with more neurons. In all cases and as shown in Fig. 2 , the cancellation NNs have 2L input nodes, which correspond to the real and imaginary parts of the L delayed versions of x[n], and two output nodes, which correspond to the real and imaginary parts of the targetŷ SI,nl [n] sample. In the following, we denote the number of hidden layers by N l and the number of hidden nodes per layer N h . Let the vector l 0 contain the 2L inputs to the NN:
The outputs of the first hidden layer neurons are given by:
where W 1 is an N h × 2L matrix containing the hidden layer weights, b 1 is an N h × 1 vector containing the hidden layer biases, and f 1 (·) is the (vectorized) non-linear activation function used in the first hidden layer. The outputs of the neurons in the hidden layers 1 < l ≤ N l are:
where W l is an N h × N h matrix containing the hidden layer weights, b l is an N h × 1 vector containing the hidden layer biases, and f l (·) is the (vectorized) non-linear activation function used in hidden layer l. Finally, the outputs of the output layer neurons are given by:
where W N l +1 is a 2 × N h matrix containing the output layer weights, b N l +1 is a 2 × 1 vector containing the output layer biases, and f N l +1 is the activation function used in the output layer. As can be seen in Fig. 2 , for l N l +1 we have:
The goal of the NN is to minimize the mean squared error between the expected NN output and the actual NN output:
where N is the total number of training samples. The MSE in (20) is minimized by choosing appropriate values for W l , b l , l ∈ {1, . . . , N l + 1}, using back-propagation [31] .
B. Computational Complexity
Let us assume that the NN uses the popular ReLU activation function in the hidden layers (which has similar complexity to a real-valued addition (i.e., f l = ReLU(x) = max(0, x), l ∈ {1, . . . , N l }) and a linear activation function in the output layer (i.e., f N l +1 (x) = x). Then, the number of realvalued multiplications and additions that are required by a NN canceller with a single hidden layer and N h hidden neurons is [22] : where the second term in both expressions comes from the linear SI canceller that is required for the NN SI canceller to work. Moreover, two additions are required to add the output of the linear SI canceller with the output of the NN canceller. 2 For the more general NN described in [24] with N l hidden layers with N h neurons each, (21)-(22) can be generalized to:
IV. NEURAL NETWORK CANCELLER HARDWARE ARCHITECTURE
In this section, we describe a generic hardware architecture that can be used to implement both the shallow NN-based SI canceller of [22] and deeper NN-based SI cancellers such as the ones described in [24] . We first provide an overview of the architecture, which is followed by a more detailed explanation of each component. In Fig. 3 , we show the highlevel architecture of a general NN-based canceller. The set of baseband samples {x[n], . . . , x[n − L + 1]} is given as an input to a linear SI canceller and a NN-based SI canceller. These two SI cancellers operate in parallel in order to generate the linear and non-linear cancellation signals, respectively, which are then added (after the denormalization step for the NN) to produce the cancellation signalŷ SI [n].
A. Macro-Pipeline Architecture
As shown in the example of Fig. 4 , in our architecture, the canceller NN layers are mapped to macro-pipeline stages. Each macro-pipeline stage requires several clock cycles to compute its outputs and it can start its computations as soon as valid outputs from the previous macro-pipeline stage become available. Due to the high throughput requirements of the SI cancellation task, we instantiate one macro-pipeline stage for each layer in the NN that is used for cancellation.
Let NE l denote the number of neurons in layer l. We note that NE 0 = 2L, NE N l +1 = 2, and NE l = N h for all hidden layers l ∈ {1, . . . , NE l }. The goal of a macro-pipeline stage is to compute l l using expressions of the form (16)- (18) . Each element j ∈ {0, . . . , NE l − 1} of l l can be computed as:
The architecture of each macro-pipeline stage is shown in more detail in Fig. 5 . More specifically, each macro-pipeline stage contains an input interface, an array of N PE processing elements (PEs), a weights-and-biases memory, a control unit, and an output interface. We note that for simplicity, all weights, biases, and partial sums have a common bit-width of Q bits and saturation is used in case of an overflow. More sophisticated quantization schemes are possible, but they are beyond the scope of this work. The N PE PEs, whose internal structure is shown in Fig. 6 , can be used to compute (25) over multiple clock cycles using one of two possible schedules. In the neuron-by-neuron (NBN) schedule, neurons are processed sequentially and each of the N PE PEs computes a part of the sum in (25) for a given neuron j. In the input-by-input (IBI) schedule, the inputs of layer l (i.e., l l−1 ) are processed sequentially and the N PE PEs update the sum in (25) 
As an NBN macro-pipeline stage generates neuron output values sequentially, the optimal accelerator structure consists of an NBN macro-pipeline stage always being followed by an IBI macro-pipeline stage, allowing the IBI stage to start performing computations once the output of the first neuron of the preceding NBN stage has been computed. Once all inputs have been processed by the IBI stage, it immediately outputs multiple values to the NBN stage which follows it. Having an NBN stage after another NBN stage means that the second NBN stage would have to wait for all outputs of the previous stage to be generated before any processing can take place, and having an IBI stage followed by another IBI stage would mean that the second IBI stage cannot start processing before the first IBI stage has processed all its inputs. This structure of NBN and IBI stages connected in an alternating fashion masks a significant part of the latency and reduces the number of interconnects between two consecutive macro-pipeline stages. Since the exact architecture of each macro-pipeline stage depends on the processing schedule, we describe the details of the corresponding architectures separately in the next two sections.
B. Neuron-by-Neuron Macro-Pipeline Architecture 1) Input Interface: The input interface consists of N PE multiplexers, which route each of the NE l−1 elements of l l−1 to the correct PE.
2) Processing Elements: In the NBN schedule, each PE is only associated with a single neuron, and therefore only a single partial sum needs to be stored in each PE. Thus, the PEs are simple multiply-and-accumulate (MAC) units and the memory shown in Fig. 6 is, in fact, a single Q-bit register.
3) Control Unit: The main tasks of the control unit are to distribute the computations to the PEs and to stall the computations when no valid inputs are available or when the Fig. 4 . Example of a macro-pipeline architecture with two stages for a neural network with N l = 1 hidden layers [1] . More macro-pipeline stages can be added to the pipeline to implement neural networks of arbitrary depth N l . clock cycles are required to process all neurons. When N PE > NE l−1 , we constrain N PE so that N PE = k · NE l−1 , k ∈ N, and hence k neurons are processed in parallel and NE l NE l−1 NPE clock cycles are required to process all neurons. 4) Weight and Bias Memories: The weight and bias memories for layer l are used to store W l and b l and they can be written externally to re-configure the NN. The weights are organized in a memory that is N PE Q bits wide so that all PEs can be provided with data in parallel. A single word of the weight memory contains N PE weight values corresponding to k different neurons. The bias memory, on the other hand, has a bit-width of kQ bits.
5) Output Interface: The output interface adds the partial sums from the N PE PEs using an adder tree, it adds the corresponding biases, and it applies the non-linear activation function f l for each of the k neurons that are being processed in parallel. A register is added between the PEs and the output interface in order to reduce the critical path of the architecture. Moreover, the output interface forwards the outputs of the k neurons that are processed in parallel to the next macropipeline stage. 6) Latency: In the remainder of this work, we select N PE carefully so that both NE l−1 NPE and NE l NE l−1
NPE
. With this setting, an NBN macro-pipeline stage requires clock cycles to produce all outputs of NN layer l. However, one full set of outputs for a NN layer is actually produced every NE l NE l−1 NPE cycles, so that the throughput of the NBN macro-pipeline stage, measured in samples per clock cycle, is
Moreover, the first k outputs of an NBN macro-pipeline stage become available after
clock cycles. Therefore, a potential IBI macro-pipeline stage that follows can already start its computations after the L first clock cycles and that only k ≤ NE l outputs need to be forwarded to the next stage at a time.
C. Input-by-Input Macro-Pipeline Architecture 1) Input & Output Interfaces: The input and output interfaces of the IBI macro-pipeline stage are similar to that of the NBN macro-pipeline stage. The main difference is that the IBI output interface forwards the outputs of all NE l neurons that are processed in parallel to the next macro-pipeline stage.
2) Processing Elements: In the IBI schedule, each PE can be associated with multiple neurons. Therefore, several partial sums potentially need to be stored in each PE. Thus, the PEs are MAC units and the memory shown in Fig. 6 has a NE l NPE Q bits.
3) Control Unit: In the IBI schedule, when N PE ≤ NE l , all N PE PEs are used to update the NE l neurons of layer l sequentially with a new input value l[i] and NE l−1 NE l NPE clock cycles are required to process all neurons. When N PE > NE l , we constrain N PE so that N PE = kNE l , k ∈ N, and k inputs are processed in parallel. Hence, NE l NE l−1 NPE clock cycles are required to process all neurons.
4) Weight and Bias Memories:
The weight and bias memories are similar to those of the NBN macro-pipeline stage. A single word of the weight memory contains N PE weights corresponding to k different neurons. The bias memory has a bit-width of NE l Q bits in the IBI macro-pipeline stage.
5) Latency:
Similarly to the NBN schedule, we choose N PE carefully so that both NE l NPE and NE l NE l−1 NPE are always integers. Then, the latency and the throughput are
clock cycles and
samples per clock cycle, respectively. Moreover, all NE l outputs of an IBI macro-pipeline stage become available simultaneously after:
clock cycles.
D. Overall Neural Network Canceller Architecture
The overall NN architecture consists of N l macro-pipeline stages with pipeline registers added between them. The first hidden layer uses an NBN macro-pipeline stage and the second hidden layer (or the output layer when N l = 1) uses an IBI macro-pipeline stage. Further layers use NBN and IBI macro-pipeline stages in an alternating fashion as previously discussed. The NE 0 = 2L inputs l 0 of the first NBN macropipeline stage that implements the computations of the first hidden layer are assumed to all be available in parallel. The number of PEs instantiated for layer l is denoted by N PE,l . The computations for the linear canceller are done in parallel with the NN by instantiating a standard complex FIR filter with N PE,linear complex-valued PEs. The latency of the linear canceller in clock cycles
Since the linear canceller is not pipelined, it holds that
Llinear . The throughput of the overall NN canceller architecture is:
Since it is typically not very costly in terms of resources to ensure that T linear ≥ T l , l ∈ {1, . . . , N l+1 }, in practice T is usually limited by min l T l . As opposed to the throughput, the latency of the overall NN canceller is more complicated to derive in general. However, in the special case where the number of PEs for each layer l is chosen such that no stalling happens and N l + 1 is even, the latency can be calculated as:
Finally, we note that the denormalization step shown in Fig. 3 is constrained to scaling with powers of two, which can be implemented efficiently with simple shifting operations, both during training and during inference.
V. POLYNOMIAL CANCELLER HARDWARE ARCHITECTURE
Since, to the best of our knowledge, there are no published implementations of polynomial SI cancellers in the literature, we provide our own optimized reference implementation. Our polynomial SI canceller architecture, which is shown in Fig. 7 , is largely based on the NN architecture since the main computational tasks of the two cancellers are very similar (i.e., computation of weighted sums). The main differences are that the input interface also computes the basis functions, that N CPE complex PEs (CPEs) are used to perform computations on complex values, and that there is only a single macro-pipeline stage. In the remainder of this section, we explain how the basis functions can be computed efficiently and we describe the polynomial SI canceller in more detail.
A. Basis Function Computation
The computation of the N BF basis functions in (5) for each cancellation sample seems like a cumbersome task. Fortunately, we can show that the basis functions have a number of properties that enable their efficient computation. First, significant basis function re-use is possible. In particular, after y SI [n − 1] has been computed based on BF p,q (x[n−1−l]), l ∈ {0, . . . , L−1}, p ∈ {1, 3, . . . , P }, q ∈ {0, . . . , p}, the basis functions for l ∈ {0, . . . , L−2} can be stored and re-used for the computation ofŷ SI [n]. As such, the only new basis functions that need to be computed forŷ SI [n] are BF p,q (x[n]), p ∈ {1, 3, . . . , P }, q ∈ {0, . . . , p}. This requires L−1 4 (P + 1) (P + 3) memory elements, but reduces the number of basis functions that need to be computed by a factor of L from L 4 (P + 1)(P + 3) to 1 4 (P + 1)(P + 3). Moreover, the following proposition shows two additional properties of the basis functions.
Proposition 1: For the basis functions in (5) , it holds that:
1) BF p,q (x) = (BF p,p−q (x)) * 2) BF p,q (x) = x 2 BF p−2,q−2 (x) Proof: Both properties follow from the definition of the basis function in (5) . Specifically, for 1) we have:
and for 2) we have:
Property 1) enables a computation reduction by a factor of two since for every p ∈ {1, 3, . . . , P }, it is sufficient to compute BF p,q (x) only for q ∈ p+1 2 , . . . , p and the remaining basis functions for q ∈ 0, . . . , p−1 2 can be obtained by simple conjugation. Moreover, property 2) reveals an efficient dynamic programming (DP) method to compute the basis end for 10: end for functions for x[n], which is shown in Algorithm 1. Algorithm 1 requires one multiplication to pre-compute (x[n]) 2 and 1 8 (P +1)(P +3)−2 multiplications for all executions of line 7. The conjugation in line 8 does not require any multiplications as it is a simple sign change of the imaginary part of BF p,p−q (x). As such, the total number of multiplications to compute the basis functions for a baseband sample x[n] is:
One downside of the DP approach is that only the inner loop in Algorithm 1 can be parallelized. However, in most typical applications we have P ≤ 9, so that the outer loop in Algorithm 1 is executed very few times. We note that, due to the efficiency of Algorithm 1, N MUL,BF is significantly smaller than N MUL,poly , which justifies ignoring the multiplications of the basis function computations in (11) for simplicity.
B. Polynomial Canceller Architecture
We use a high-level structure that is similar to the NN-based cancellers in Fig. 3 in the sense that linear cancellation is done in parallel to non-linear cancellation and the polynomial SI canceller focuses only on the non-linear part of the SI signal. Since most of the SI signal is linear, removing the linear term separately significantly reduces the dynamic range of the values within the polynomial SI canceller, which in turn allows us to reduce the common quantization bit-width Q for the real and the imaginary parts of the involved quantities.
1) Input & Output Interfaces: The input interface consists of N CPE multiplexers, which route each of the N BF BFs to the correct CPE in order to compute parts of the sum in (7) . As mentioned previously, the input interface also computes the BFs using N CPE,BF CPEs. Since only the inner loop in Algorithm 1 can be parallelized, it is reasonable to constrain N PE,BF so that N PE,BF ≤ P +1 2 . The number of clock cycles required to compute all BFs with N PE,BF PEs is:
where one clock cycle is used to pre-compute x 2 and the result (as well as x * ) are stored in two 2Q-bit registers. The The output interface consists of an adder tree that adds up the partial sums stored in the N CPE CPEs in order to produce the final result.
2) Complex Processing Elements: The N CPE CPEs are complex MAC units with a Q-bit register to store partial sums. The complex MAC units are implemented using three realvalued multipliers and five real-valued adders.
3) Control Unit: Similarly to the NN-based canceller, the main tasks of the control unit are to distribute the computations to the CPEs and to stall the computations when no valid inputs are available. The control unit schedules the operation so that the CPEs first compute the terms of (7) that are based on BFs that are already available in the circular buffer. In the meantime, the input interface computes the 1 4 (P + 1)(P + 3) BFs that depend on the new sample x[n].
4) Parameter Memory: The parameter memory is used to store the complex-valuedĥ p,q parameters of the polynomial canceller. The memory contains NBF NCPE words that are 2QN CPE bits wide so that all N CPE CPEs can be provided with the parameters in parallel.
5) Latency:
In most practical cases, the latency of computing the new BFs is significantly smaller than the latency of computing the terms of (7) , so that the latency of this operation is masked entirely and can be ignored. For example, for P = 7 and N CPE,BF = 1, according to (38) it takes only 10 clock cycles to compute the new BFs. Setting N CPE,BF = 2 reduces the number of required cycles to 6. As such, we can safely assume that the latency of the polynomial canceller is limited by the computation of (7) . The latency of the polynomial SI canceller is given by:
where one clock cycle is required by the adder tree in the output interface to produce the final output. Since a pipeline register is inserted before the adder tree of the output interface, the throughput of the polynomial SI canceller, measured in samples per clock cycle, is given by T poly = 1 Lpoly .
VI. NUMERICAL AND HARDWARE IMPLEMENTATION RESULTS
In this section, we compare the polynomial SI canceller with the NN-based SI cancellers in terms of their SI cancellation performance and their hardware implementation complexity. Specifically, we first provide a performance comparison of the polynomial SI canceller with the shallow NN SI canceller of [22] as well as with an instance of a deep NN SI canceller of [24] . We then present a comparison of FPGA and ASIC implementation results for the polynomial SI canceller and the NN-based SI cancellers in an equi-performance scenario.
A. Self-Interference Cancellation Performance Comparison 1) Comparison Setup: The complexity expressions for the polynomial SI canceller in (10)-(11) and the NN SI cancellers in (23)- (24) can not be compared directly because they contain different sets of parameters. Thus, in order to perform a fair complexity comparison, we select values for L, P , N l , and N h so that the compared polynomial and NN cancellers have as similar SI cancellation performance as possible.
For the performance evaluation, we use the dataset that was used in [22] , which consists of a 10 MHz QPSK-modulated OFDM SI signal sampled at 20 MHz that is generated using an actual full-duplex testbed with a transmit power of 10 dBm. 3 The dataset contains 20, 000 time-domain SI baseband samples, out of which 90% is used for training and 10% for the evaluation of the SI performance. For NN training, we use a mini-batch size of B = 32 and the Adam optimizer [32] with a learning rate of λ = 0.004.
2) Results: In Fig. 8 , we show the power spectral density (PSD) of the received SI signal y SI [n] before any SI cancellation is performed, the PSD of the received signal when no transmission takes place (i.e., the effective noise floor of the receiver), as well as the PSDs of the SI signals after linear SI cancellation and after non-linear SI cancellation with the polynomial and NN-based cancellers. The results are obtained 3 The dataset is available at https://github.com/abalatsoukas/fdnn. Training Epoch Non-Linear SI Cancellation (dB)
Training Test N l = 5:
Training Test Fig. 9 . Training convergence of shallow (N l =1) and deep (N l =5) NN-based SI cancellers with N h =18 and N h =6 neurons per layer, respectively.
for L = 13 for all SI cancellers and P = 5 for the polynomial SI canceller. Moreover, the shallow NN has N l = 1 hidden layer with N h = 18 neurons, while the deep NN has N l = 5 hidden layers with N h = 6 neurons per layer. These parameter values are chosen as follows. First, we select L by slowly increasing its value until no further linear cancellation gains are obtained. We then use the polynomial SI canceller and we increase P until the gain in SI cancellation performance becomes very small. When going from P = 5 to P = 7, the SI cancellation only improves by 0.3 dB while the computational complexity almost doubles, so that P = 5 provides a sensible complexity-performance trade-off. We then use the same value of L for the NN-based SI cancellers and we select N l and N h to match the performance of the polynomial SI canceller. We observe that, with these parameter settings, all SI cancellers indeed achieve very similar performance and can cancel the SI very close to the receiver noise floor, with the polynomial canceller being 0.4 dB and 0.5 dB better than the shallow and the deep NN-based cancellers, respectively. We note that, in light of our recent results in [24, Fig. 5 ], using P = 5 for the polynomial SI canceller results in a more fair comparison than using P = 7 as was done in [22] . However, as we show in the sequel, even in this case there are very clear advantages in terms of the implementation complexity when using an NN-based canceller.
In Fig. 9 , we show the training convergence behavior of the shallow and deep NN SI cancellers. We observe that the shallow and deep NNs require only 7 and 12 training epochs to achieve more than 5 dB of non-linear cancellation, respectively. Training for up to 20 epochs further increases the cancellation performance to approximately 5.5 dB, but there are clear diminishing returns. Moreover, we observe that the cancellers have similar performance on the training and test samples, meaning that there are no overfitting issues. We note, however, that during our experiments we observed that the deep NN is more sensitive to the weight initialization.
In Table I , we show the complexity of the three SI cancellers in terms of the number of real-valued multiplications and additions given by (10)- (11) and (23)- (24) . We observe that the polynomial SI canceller requires the largest number of additions, while the shallow NN SI canceller requires the largest number of multiplications. The deep NN SI canceller, on the other hand, achieves a significant complexity reduction as it requires 25% fewer multiplications and 60% fewer additions than the polynomial SI canceller.
B. FPGA and ASIC Implementation Results

1) Comparison Setup:
In Section VI-A1, we already showed that using L = 13 for all cancellers, P = 5 for the polynomial canceller, a single hidden layer with N h = 18 neurons for the shallow NN canceller, and N l = 5 hidden layers with N h = 6 for the deep NN canceller, leads to practically identical SI cancellation performance. However, in order to perform a meaningful comparison of FPGA and ASIC implementation results, the quantization bit-width Q for the different cancellers also needs to be selected to individually minimize the implementation complexity while keeping the performance of the SI cancellers similar. In Fig. 10 , we show the cancellation performance for the polynomial SI canceller and the NN SI cancellers as a function of the quantization bitwidth Q. We observe that both NN SI cancellers generally require a lower quantization bit-width Q compared to the polynomial SI canceller to achieve the same SI cancellation performance. Moreover, for the hardware implementation results presented in this section, we can choose Q = 17 for the shallow NN SI canceller, Q = 19 for the deep NN SI canceller, and Q = 20 for the polynomial SI canceller, as this choice leads to effectively identical SI cancellation performance and a very small loss with respect to the corresponding floating-point implementations for all cancellers. The deep NN SI canceller requires two additional integer bits compared to the shallow NN SI canceller, due to larger absolute output values in the hidden layers. This effectively shifts the bit-width versus SI cancellation performance curve of the deep NN SI canceller by two bits to the right compared to the shallow NN SI canceller.
For the shallow NN SI canceller, we set N PE,1 = 52 and N PE,2 = 4 so that T 1 = T 2 = 1 /9. With this setting, the macropipeline is perfectly balanced and one cancellation sample is produced every 9 clock cycles. Furthermore, N CPE,linear = 2 CPEs are instantiated for the NN SI canceller to ensure that the linear cancellation step can be completed in the same number of cycles. For the deep NN SI canceller, we set N PE,1 = 26 15 for the first hidden layer, N PE,l = 6 for the remaining hidden layers, and N PE,6 = 2 for the output layer so that T l = 1 /6 for all l ∈ {1, . . . , 6} and the macro-pipeline is balanced. The deep NN SI canceller uses N CPE,linear = 3 CPEs for the linear canceller due to the increased throughput requirements. The shallow NN SI canceller thus requires a total of 56 PEs and the deep NN SI canceller requires a total 52 PEs, but the deep NN SI canceller requires one more CPE than the shallow NN canceller. Finally, for the polynomial canceller, we use N CPE = 12 complex PEs so that the 156 complex multiplications required to compute (7) for L = 13 and P = 5 can be carried out in L poly = 13 clock cycles. We also use N CPE,BF = 1 CPE for the BF computation, meaning that L BF = 10. Since L poly > L BF , one cancellation sample is produced every L poly = 13 clock cycles.
2) FPGA Implementation Results: In Table II , we show place-and-route (PAR) results on a Xilinx Virtex-7 XC7VX485 (speed grade -2) FPGA, which contains a total of 75.9k slices, 303.6k LUTs, 607.2k flip-flops, and 2.8k DSP slices. A clock frequency target of 100 MHz is used for all cancellers.
We observe that the shallow NN SI canceller has a lower slice and LUT as logic utilization than the polynomial SI canceller and a 45% higher throughput. The higher throughput of the shallow NN SI canceller comes both from a lower number of cycles per sample and from a slightly higher operating frequency compared to the polynomial SI canceller. We also note that the polynomial SI canceller requires approximately 32% fewer DSP slices than both the shallow and the deep NN SI canceller. The deep NN SI canceller uses more resources than the shallow NN SI canceller, but it has a 77% higher throughput than the polynomial SI canceller. The main additional cost for the deep NN SI canceller compared to the shallow NN SI canceller comes from registers and LUTs used as logic. Even though the deep NN SI canceller is not able to achieve a clock frequency as high as the other cancellers, it still has the highest throughput. We also observe that the shallow NN SI canceller has the lowest latency, followed by the polynomial SI canceller and the deep NN SI canceller.
3) ASIC Implementation Results: In Table III , we present ASIC implementation results for the polynomial SI canceller and the two NN SI cancellers using a 28 nm FD-SOI technology. We target two different points in terms of the operating frequency, namely, a maximum throughput point and a point where each SI canceller achieves a throughput of exactly 20 MS/s, which is sufficient for the dataset that we consider. In both cases, we use slow-slow corners, an operating voltage of 0.7 V, and an operating temperature of 125 • C. For the power results, post-PAR simulations are used both to verify the design and to accurately estimate the switching activity.
We observe that, for the maximum throughput operating point, the NN SI cancellers are both significantly faster and more energy efficient than the polynomial SI canceller. However, the polynomial SI canceller is generally smaller and has a lower power consumption. More specifically, the shallow NN SI canceller is 33% faster but also 30% larger than the polynomial SI canceller. As a result, its area efficiency is 7% lower than that of the polynomial SI canceller. The deep NN SI canceller, on the other hand, is only 4% larger than the polynomial SI canceller and at the same time 115% faster and has a 106% better area efficiency. The energy efficiency of the NN SI cancellers is also significantly better compared to the polynomial SI canceller. Specifically, the shallow and deep NN SI cancellers improve the energy efficiency by 6% and 50% compared to the polynomial SI canceller, respectively. Finally, we observe that the deep NN SI canceller has the worst latency (60 ns), followed by the polynomial SI canceller (37 ns) and the shallow NN canceller (31 ns).
As mentioned previously, for the dataset that we use in this work, a throughput of 20 MS/s is sufficient for real-time operation. We observe that the relaxed timing requirements, in this case, reduce the area of the polynomial SI canceller by 13%, the shallow NN SI canceller by 7% and the deep NN SI canceller by 13% compared to the results for the maximum throughput operating point. Interestingly, the energy efficiency per sample only improves for the polynomial canceller, whereas it becomes slightly worse for the two NN SI cancellers. Nevertheless, the deep NN SI canceller is still 30% more energy efficient than the polynomial SI canceller and only 4% less area efficient. Moreover, at this operating point, the polynomial canceller has the lowest latency (48 ns) and the deep NN canceller has the lowest power consumption.
VII. CONCLUSION
In this paper, we presented a high-throughput hardware architecture for a NN-based SI cancellation scheme for fullduplex radios. We also presented, to the best of our knowledge, the first efficient hardware architecture for polynomial SI cancellation in the literature, which we used as a comparison baseline for the NN-based SI cancellers. Our implementation results show that the NN SI cancellers have significantly lower computational complexity than a conventional polynomial SI canceller, which translates into substantial area and energy savings when the schemes are implemented in hardware. Specifically, an ASIC implementation of a deep NN-based SI canceller has up to 2.1× and 2× better hardware efficiency and energy effiency when compared to a conventional polynomial SI canceller, respectively.
