Abstract-In-band full-duplex systems are able to transmit and receive information simultaneously on the same frequency band. Due to the strong self-interference caused by the transmitter to its own receiver, the use of non-linear digital selfinterference cancellation is essential. In this work, we present a hardware architecture for a neural network based non-linear selfinterference canceller and we compare it with our own hardware implementation of a conventional polynomial based canceller. We show that, for the same cancellation performance, the neural network canceller has a significantly higher throughput and requires fewer hardware resources.
I. INTRODUCTION
In-band full-duplex (FD) communication has for long been considered to be impractical due to the strong self-interference (SI) caused by the transmitter to its own receiver. However, more recent work on the topic (e.g., [1] , [2] , [3] ) has demonstrated that it is in fact possible to achieve sufficient SI cancellation to make FD systems viable. A portion of the SI is usually first removed in the analog RF domain. As analog cancellation alone is often not sufficient, the residual SI needs to be cancelled in the digital domain. In principle, the residual SI should be easy to cancel since it is produced by a known transmitted signal. In practice, however, the different stages of the transceiver introduce non-linearities to the signal, such as digital-to-analog converter (DAC) and analog-to-digital converter (ADC) non-linearities, IQ imbalance, and power amplifier (PA) non-linearities. Intricate memory polynomial models have to be used in order for the digital SI cancellation to be able to handle the aforementioned non-linearities (e.g., [4] , [5] , [6] , [7] , [8] ). An alternative solution, which uses a neural network (NN) to reconstruct the non-linearities in order to generate the SI cancellation signal, was recently proposed in [9] and it was shown that it can achieve similar SI cancellation performance with the state-of-the-art polynomial model of [8] , but with much lower computational complexity.
Existing NN hardware accelerators, such as [10] , [11] , mainly target applications where both the size of the NN and the number of inputs is very large, and where producing a a few tens of outputs per second is sufficient. Communications applications, on the other hand, use relatively small NNs with few inputs, but need to provide millions of outputs per second. As such, communications applications require vastly different NN hardware accelerator architectures. 
Contribution:
In this work, we present a hardware implementation of the SI cancellation method proposed in [9] in order to quantify and translate the computational complexity gains over the state-of-the-art polynomial based model of [8] into real-world hardware resource utilization gains. We provide FPGA and ASIC implementation results that clearly demonstrate the significant gains that can be achieved by our proposed NN-based canceller in terms of both the resource utilization and the achieved throughput. To the best of our knowledge, this is the first hardware implementation of a NNaugmented communications system in literature related to the recent resurgence of machine learning for communications.
II. DIGITAL SELF-INTERFERENCE CANCELLATION
A basic block diagram of a full-duplex wireless transceiver is shown in Fig. 1 . If we assume, for simplicity, that there is no signal-of-interest from a remote node and no thermal noise, then the received signal y(n) is the SI signal. The goal of digital SI cancellation is to reproduce an accurate copy of y(n), denoted byŷ(n), based on the transmitted baseband signal x(n). This signal is then subtracted from y(n), so that the residual SI signal is y c (n) = y(n) −ŷ(n). Ifŷ(n) is reconstructed perfectly, then the SI can be cancelled entirely and y c (n) = 0. In practice, however, due to the presence of thermal noise and transceiver non-linearities, perfect SI cancellation is difficult to achieve.
1) Polynomial Non-Linear Cancellation:
A state-of-theart polynomial SI cancellation model, which can effectively suppress IQ imbalance and PA non-linearities, was described
Example of a neural network for the reconstruction of the non-linear component of the SI signal [9] .
in [8] . Specifically, it was shown that an accurate SI cancellation signalŷ(n) can be obtained as:
where x(n) is the transmitted digital baseband signal, L corresponds to the overall memory of the system, P is the non-linearity order, andĥ p,q are estimated parameters that can be obtained using, e.g., least-squares estimation.
2) Neural Network Non-Linear Cancellation:
The NNbased method of [9] uses two steps, as illustrated in Fig. 3 . First, standard linear cancellation is used in order to reconstruct the linear component of the SI, denoted byŷ lin (n):
whereĥ are estimated parameters that can be obtained using, e.g., least-squares estimation. A two-layer real-valued neural network, shown in Fig. 2 , generates the non-linear part of the SI cancellation signal, denoted byŷ nn (n). Finally, the two components are added in order to create the SI cancellation signalŷ(n) =ŷ lin (n) +ŷ nn (n). The denormalization step in Fig. 3 is necessary because the NN learns to reproduce a normalized (i.e., zero-mean and unit-variance) version ofŷ nn , as this generally improves the convergence of NN training.
3) Computational Complexity:
Assuming that each complex multiplication can be implemented using three real multiplications and five real additions and that each complex addition can be implemented using two real additions, the total number of real multiplications and additions that are required by the polynomial canceller is [9] 1 : The number of real multiplications and additions that are required by the NN canceller is [9] :
where the second term in both expressions comes from the linear canceller. The complexity expressions for the two methods can not be compared directly because they contain different sets of parameters. In order to perform a fair comparison we select values for L, P , and N h so that the two methods have the same SI cancellation performance in Section IV.
III. HARDWARE ARCHITECTURE
In this section, we describe a hardware architecture that implements the NN-based SI canceller of [9] . We first give a global overview of the architecture, which is followed by a more detailed explanation of each component. As shown in Fig. 4 , we map each layer of the NN to a macro-pipeline stage that requires several clock cycles to compute its outputs. Each macro-pipeline stage can start its computations as soon as valid outputs from the previous pipeline stage become available.
A. Macro-Pipeline Architecture
Let N I and N n denote the number of inputs per neuron (which is equal to the number of neurons of the previous layer) and the number of neurons for a given NN layer, respectively. The goal of a macro-pipeline stage is to process each neuron of its corresponding layer by computing the following outputs:
where x i are the inputs, w i,j are the weights, b j are the biases, and f (x) is a non-linear activation function ( [9] uses a ReLU activation function). The architecture of each macro-pipeline stage is shown in more detail in Fig. 5 . More specifically, each macro-pipeline stage contains an input interface, an array of N PE processing elements (PEs), a weights-and-biases memory, a control unit, and an output interface. We note that all weights, biases, and partial sums have a common bit-width of Q bits and saturation is used in case of an overflow.
The N PE PEs, whose internal structure is shown in Fig. 6 , can be used to compute (7) over multiple clock cycles using one of two possible schedules. In the neuron-by-neuron (NBN) schedule, neurons are processed sequentially and each of the N PE PEs computes a part of the sum in (7) for a given neuron j. In the input-by-input (IBI) schedule, on the other hand, the layer inputs x i are processed sequentially and the N PE PEs update the sum in (7) with the term w i,j x i for N PE neurons in parallel. When an NBN macro-pipeline stage is followed by an IBI macro-pipeline stage, the IBI stage can already start performing computations once the output of the first neuron of the NBN stage has been computed, thus masking a significant part of the latency and reducing the number of interconnects between the two stages. Since the exact architecture of each macro-pipeline stage depends on the processing schedule, we describe the details of the corresponding architectures separately in the next two sections.
B. Neuron-by-Neuron Macro-Pipeline Architecture

1) Input Interface:
The input interface consists of N PE multiplexers, which route each input to the correct PE.
2) Processing Elements: In the NBN schedule, each PE is only associated with a single neuron, meaning that only a single partial sum needs to be stored. Thus, the PEs are simple multiply-and-accumulate (MAC) units and the memory shown in Fig. 6 is in fact a single Q-bit register.
3) Control Unit: The main tasks of the control unit are to distribute the computations to the PEs and to stall the computations when no valid inputs are available or when the following macro-pipeline stage is not ready to accept new outputs. The computations are dispatched to the PEs as follows. When N PE ≤ N I , all N PE PEs are used to process a single neuron at a time and N n NI NPE clock cycles are required to process all neurons. When N PE > N I , we constrain N PE so that N PE = kN I , k ∈ N, meaning that k neurons are processed in parallel and
NnNI NPE
clock cycles are required to process all neurons.
4) Weights and Biases Memories:
The weights and biases memories are used to store w i,j and b j and they can be written externally to re-configure the NN. The weights are organized in a memory that is N PE Q bits wide so that all PEs can be provided with data in parallel. A single word of the weights memory contains N PE weight values corresponding to k different neurons. The biases memory, on the other hand, has a bit-width of kQ bits. 
5) Output Interface:
The output interface adds the partial sums from the N PE PEs using an adder tree, it adds the biases, and it applies the non-linear activation function for each of the k neurons that are being processed in parallel. A register is added between the PEs and the output interface in order to reduce the critical path of the architecture. Moreover, the output interface forwards the outputs of the k neurons that are processed in parallel to the next macro-pipeline stage. 
clock cycles to produce all outputs of a NN layer. However, one full set of outputs for a NN layer is actually produced every NnNI NPE cycles, so that the throughput of the NBN macropipeline stage is
Moreover, the first k outputs of an NBN macro-pipeline stage become available after
clock cycles and after that k new outputs are produced in every clock cycle. This means that a potential IBI macropipeline stage that follows can already start its computations after L f clock cycles and that only k ≤ N n outputs need to be forwarded to the next stage in each clock cycle.
C. Input-by-Input Macro-Pipeline Architecture 1) Input & Output Interface:
The input and output interfaces are similar to that of the NBN macro-pipeline stage, the main difference being that the IBI output interface forwards the outputs of all N n neurons that are processed in parallel to the next macro-pipeline stage.
2) Processing Elements: In the IBI schedule, each PE can be associated with multiple neurons, meaning that several partial sums potentially need to be stored. Thus, the PEs are MAC units and the memory shown in Fig. 6 has a 
respectively. Moreover, all N n outputs of an IBI macropipeline stage become available simultaneously after
D. Overall Neural Network Canceller Architecture
The overall architecture for the two-layer NN of [9] consists of two macro-pipeline stages, one for the hidden layer and one for the output layer, and pipeline registers are added between the macro-pipeline stages. The hidden layer uses an NBN macro-pipeline stage, while the output layer uses an IBI macro-pipeline stage. For the hidden layer, we have N I = 2L and N n = N h , while for the output layer we have N I = N h and N n = 2. The N I = 2L inputs of the first macro-pipeline stage that implements the computations of the hidden layer are assumed to all be available in parallel. The number of PEs instantiated for the hidden layer and the output layer is N PE,h and N PE,o , respectively. The computations for the linear canceller are done in parallel with the NN by instantiating a standard complex FIR filter. If we denote the throughput of the hidden and the output macro-pipeline stages by T h and T o , respectively, then the throughput of the two-layer NN architecture is
Finally, we note that we constrain the denormalization step shown in Fig. 3 to scaling with powers of two, which can be implemented efficiently with simple shifting operations, both during training and during inference.
IV. FPGA AND ASIC IMPLEMENTATION RESULTS
In this section, we present implementation results for the NN-based canceller and we compare it with a polynomial canceller. Since, to the best of our knowledge, there are no published implementations of polynomial cancellers in the literature, we provide our own reference implementation. Due to space limitations, we do not describe the implementation in detail, but it is largely based on the NN architecture since the main computational task of the polynomial canceller is similar to the NN canceller, i.e., to compute a weighted sum. The main differences are that the input interface also computes the basis functions and that the PEs operate directly on complex numbers. Each complex PE of the polynomial canceller is implemented using three real multipliers.
A. Comparison Setup
In order to provide a fair comparison between the NN-based SI canceller and the polynomial canceller, we select L, N h , P , and the quantization bit-width Q so that the fixed-point performance of the two cancellers is as similar as possible. For performance evaluation, we used the same dataset that was used in [9] , which consists of a 10 MHz QPSK-modulated OFDM signal sampled at 20 MHz that is generated using the testbed described in [12] and [13] .
For L = 13, P = 7, and N h = 18, the performance of the two cancellers is very similar, as can be seen in Table I. In Fig. 7 , we show the cancellation performance for the NN-based canceller and the polynomial based canceller as a function of Q. We observe that, for the same cancellation performance, the NN-based canceller generally requires a lower quantization bit-width Q. For the hardware implementation results, we choose Q = 17 for the NN-based canceller and Q = 23 for the polynomial canceller so that the two cancellers have the same fixed-point cancellation performance.
We set N PE,h = 52 and N PE,o = 4 for the NN-based canceller so that T h = T o = 1 /9, meaning that the macropipeline is perfectly balanced and one cancellation sample is output every 9 clock cycles. Furthermore, 2 complex PEs are instantiated for the NN-based canceller in order to perform the linear cancellation step in the same time. For the polynomial canceller, we use N PE,h = 20 complex PEs so that the 260 complex multiplications required to compute (1) for L = 13 and P = 7 can be carried out in 13 clock cycles, which means that one cancellation sample is output every 13 clock cycles. 
B. Implementation Results
The placed-and-routed implementation results on a Xilinx Virtex-7 FPGA are given in Table II . We observe that the NN-based canceller has significantly lower resource utilization than the polynomial canceller and a 96% higher throughput. The higher throughput of the NN-based canceller comes both from a lower number of cycles per sample and from a higher operating frequency compared to the polynomial canceller. We also note that the polynomial canceller requires approximately two times more DSP slices than the NN-based canceller. This happens because the DSP slices on Xilinx Virtex-7 FPGAs do not support multiplications between two Q = 23-bit values and two DSP slices have to be instantiated for each multiplication in the polynomial canceller.
The fully placed-and-routed ASIC implementation results using a 28 nm FD-SOI technology are shown in Table III . We observe that the NN-based canceller has a 60% better throughput and that it occupies 11% less area than the polynomial canceller, leading to an 81% better hardware efficiency. Similarly to the FPGA results, the better throughput of the NNbased canceller comes both from a lower number of cycles per sample and from a higher operating frequency compared to the polynomial canceller.
V. CONCLUSION
In the paper, we described a high-throughput hardware architecture for a NN-based self-interference cancellation scheme for full-duplex radios. Our implementation results show that the NN-based canceller has a lower computational complexity and that a 22% lower datapath quantization bit-width to achieve the same cancellation performance as a polynomial cancellation scheme. The NN-based canceller thus requires significantly fewer resources on an FPGA and achieves an 81% better hardware efficiency than the polynomial canceller when implemented for an ASIC target.
VI. ACKNOWLEDGMENT
