Hardware Implementation of Neural Self-Interference Cancellation by Kurzo, Yann et al.
1Hardware Implementation of Neural
Self-Interference Cancellation
Yann Kurzo, Andreas Toftegaard Kristensen, Andreas Burg, Member, IEEE,
Alexios Balatsoukas-Stimming, Member, IEEE
Abstract—In-band full-duplex systems can transmit and re-
ceive information simultaneously on the same frequency band.
However, due to the strong self-interference caused by the
transmitter to its own receiver, the use of non-linear digital self-
interference cancellation is essential. In this work, we describe
a hardware architecture for a neural network-based non-linear
self-interference (SI) canceller and we compare it with our own
hardware implementation of a conventional polynomial based
SI canceller. In particular, we present implementation results
for a shallow and a deep neural network SI canceller as well
as for a polynomial SI canceller. Our results show that the
deep neural network canceller achieves a hardware efficiency
of up to 312.8 Msamples/s/mm2 and an energy efficiency of
up to 0.9 nJ/sample, which is 2.1× and 2× better than the
polynomial SI canceller, respectively. These results show that NN-
based methods applied to communications are not only useful
from a performance perspective, but can also be a very effective
means to reduce the implementation complexity.
I. INTRODUCTION
In-band full-duplex (FD) communications have for long
been considered to be impractical due to the strong self-
interference (SI) caused by the transmitter to its own receiver.
However, recent work on the topic (e.g., [2], [3], [4]) has
demonstrated that it is, in fact, possible to achieve sufficient
SI cancellation to make FD systems viable. Typically, SI can-
cellation is performed in both the radio frequency (RF) domain
and the digital domain to cancel the SI signal down to the level
of the receiver noise floor. There are several ways to achieve
RF cancellation that can be broadly categorized into passive
RF cancellation and active RF cancellation. Some form of RF
cancellation is generally necessary to avoid saturating the ana-
log front-end of the receiver. Passive RF cancellation can be
obtained by using, e.g., circulators, directional antennas, beam-
forming, polarization, or shielding. Active RF cancellation is
commonly implemented by transforming the transmitted RF
signal appropriately to emulate the SI channel using analog
components and subtracting the resulting SI cancellation signal
Y. Kurzo is with ON Semiconductor, 2074 Marin-Epagnier, Switzerland
(e-mail: yann.kurzo@gmail.com).
A. Kristensen and A. Burg are with the Telecommunications Circuits Labo-
ratory, E´cole polytechnique fe´de´rale de Lausanne, 1015 Lausanne, Switzerland
(e-mail: {andreas.kristensen,andreas.burg}@epfl.ch).
A. Balatsoukas-Stimming is with the Eindhoven University
of Technology, 5600 MB Eindhoven, The Netherlands (e-mail:
a.k.balatsoukas.stimming@tue.nl).
Parts of this work were presented at the 2018 Asilomar Conference on
Signals, Systems, and Computers [1].
This work was supported by the Swiss National Science Foundation under
project #200021 182621.
from the received SI signal [2], [4]. Alternatively, an additional
transmitter can be used to generate the SI cancellation signal
from the transmitted baseband samples [3].
However, a residual SI signal is typically still present at the
receiver after RF cancellation has been performed. This resid-
ual SI signal can, in principle, be easily canceled in the digital
domain, since it is caused by a known transmitted signal.
Unfortunately, in practice, several transceiver non-linearities
distort the SI signal. Some examples of non-linearities in-
clude baseband non-linearities (e.g., digital-to-analog con-
verter (DAC) and analog-to-digital converter (ADC)) [5], IQ
imbalance [5], [6], phase-noise [7], [8], and power amplifier
(PA) non-linearities [5], [6], [9], [10]. These effects need to be
taken into account using intricate polynomial models to cancel
the SI to the level of the receiver noise floor. These polynomial
models perform well in practice, but their implementation
complexity grows rapidly with the maximum considered non-
linearity order. Principal component analysis (PCA) is an
effective complexity reduction technique that can identify the
most significant non-linearity terms in a parallel Hammerstein
model [10]. However, with PCA-based methods, the trans-
mitted digital baseband samples need to be multiplied with
a transformation matrix to generate the SI cancellation signal,
thus introducing additional complexity. Moreover, whenever
the SI channel changes, the high-complexity PCA operation
needs to be re-run. To the best of our knowledge, no hardware
implementation of a polynomial SI canceller has been reported
in the open literature to date. Only the work of [11] has made a
step in this direction, since the authors considered quantization
aspects of polynomial SI cancellers.
In the past few years, there has been renewed interest in the
use of neural networks (NNs) to augment or replace a range
of signal processing tasks in communications systems [12],
[13], [14], [15], [16], [17]. NNs are particularly well-suited to
tackle non-linear signal processing problems, where traditional
model-based algorithms are unavailable or too complex for
analytical treatment. However, NN-based solutions can also
be used in cases where traditional model-based algorithms
suffer from prohibitively high implementation complexity.
For example, NNs have been used successfully to perform
digital predistortion (DPD) in wireless systems [18], [19], non-
linear leakage cancellation in FDD transceivers [20], as well
as optical fiber non-linearity compensation [21]. NNs have
also been used for non-linear SI cancellation in full-duplex
communications [22], [23], [24] and it was shown in [22]
that they can achieve similar SI cancellation performance with
ar
X
iv
:2
00
1.
04
54
3v
1 
 [e
es
s.S
P]
  1
3 J
an
 20
20
2a state-of-the-art polynomial SI cancellation model, but with
much lower computational complexity.
Existing NN hardware accelerators, such as [25], [26],
mainly target applications where both the size of the NN and
the number of inputs is very large, and where producing a
few tens of outputs per second is sufficient. Communications
applications, on the other hand, use relatively small NNs
with few inputs, but need to provide millions of outputs per
second. As such, communications applications generally re-
quire different and more specialized NN hardware accelerator
architectures. However, to date, only a small number of works
have considered these hardware-related issues in the context of
communications applications. Specifically, the works of [27],
[28] study NN quantization as a first step towards hardware
implementation, while the authors of [1], [29] describe actual
hardware implementations of simple NNs for SI cancellation
in full-duplex communications and DPD, respectively.
Contribution: In this work, we present a hardware imple-
mentation of the SI cancellation method proposed in [22] to
quantify and translate the computational complexity gains over
the state-of-the-art polynomial based model of [10] into real-
world hardware resource utilization gains. Moreover, we also
implement an instance of the deep NN canceller proposed
in [24] which leads to a significant additional complexity
reduction compared with the shallow NN canceller of [22].
Since, to the best of our knowledge, no polynomial SI can-
celler implementations have been reported in the literature, we
also present a hardware architecture for a reference polynomial
SI canceller. We note that this hardware architecture can
also be used for other related applications such as digital
predistortion and leakage cancelletion in FDD transceivers.
We provide FPGA and ASIC implementation results that
clearly demonstrate the significant gains with respect to the
polynomial SI canceller that can be achieved by both the
shallow and the deep NN-based SI cancellers in terms of
resource utilization, throughput, and energy efficiency.
Outline: The remainder of this paper is organized as fol-
lows. Section II provides background on full-duplex com-
munications and digital SI cancellation using polynomial
cancellers, while Section III describes how SI cancellation
can be achieved using NNs. In Section IV we describe our
proposed NN-based SI canceller hardware architecture and
in Section V we describe our proposed baseline polynomial-
based SI canceller hardware architecture. In Section VI, we
compare the performance and the complexity of a conventional
polynomial SI canceller with the NN-based SI cancellers. In
Section VI we also provide FPGA and ASIC implementation
results. Finally, Section VII concludes this paper.
II. CONVENTIONAL DIGITAL SELF-INTERFERENCE
CANCELLATION
Fig. 1 shows a block diagram of a full-duplex transceiver.
On the transmitter side, the digital baseband samples x[n] ∈ C,
where n is the sample index, are converted to an analog
signal using a digital-to-analog converter (DAC), up-converted
to a carrier frequency fc using an IQ mixer, amplified using
a power amplifier (PA), and filtered using a bandpass (BP)
filter. The transmitted signal leaks to the receiver through
an SI channel hSI and is then filtered using a BP filter,
amplified using an LNA, downconverted using an IQ mixer,
and digitized using an analog-to-digital-converter (ADC). The
SI channel hSI also models the passive RF SI cancellation.
An RF cancellation signal is subtracted from the received SI
signal at some point before the LNA to avoid saturating the
receiver. Since the transmitter and the receiver are co-located,
they share a common local oscillator (LO) signal in order to
minimize the effect of phase noise on the SI signal.
If we assume, for simplicity of exposition, that there is no
signal-of-interest from a remote node and no thermal noise,
then the received signal y[n] in Fig. 1 consists only of the
residual SI signal after RF SI cancellation has been performed.
We denote the received signal in this special case by ySI[n].
The goal of digital SI cancellation is to reproduce an accurate
copy of ySI[n], denoted by yˆSI[n], based on samples of the
transmitted baseband signal x[n]. This signal is then subtracted
from y[n] so that the residual SI signal is ySI[n] − yˆSI[n]. If
yˆSI[n] is reconstructed perfectly, then the SI can be canceled
entirely and ySI[n] − yˆSI[n] = 0. In practice, as discussed
previously, due to the presence of thermal noise and transceiver
non-linearities, perfect SI cancellation is difficult to achieve.
The SI cancellation performance CdB is typically evaluated as:
CdB = 10 log10
( ∑
n |ySI[n]|2∑
n |ySI[n]− yˆSI[n]|2
)
. (1)
A. Linear Self-Interference Cancellation
Linear SI cancellation is the simplest form of SI cancellation
that ignores all non-linear effects of the various components
in Fig. 1. The linear SI cancellation signal is constructed as:
yˆSI[n] =
L−1∑
l=0
hˆ[l]x[n− l], (2)
where hˆ[l] ∈ C, l ∈ {0, . . . , L − 1}, models the SI channel
hSI and any other memory effect in the transceiver chain. The
parameters hˆ[l] can be obtained from training samples either in
a one-shot fashion using standard least-squares (LS) estimation
or adaptively using an iterative version of the LS estimation
algorithm, such as least mean squares (LMS) or recursive least
squares (RLS).
B. Polynomial Non-Linear Self-Interference Cancellation
Each active component in the transceiver model shown
in Fig. 1 is generally a dynamic non-linear system. This
means that linear cancellation alone is, in most cases, not
accurate enough to cancel a sufficiently large fraction of the
SI signal. It has been shown that the transmitter IQ imbalance
and the PA non-linearities typically dominate all remaining
non-linearities [9], [10]. This is true in particular when the
transmitter and receiver chains use the same local oscillator
signal for upconversion, as shown in Fig. 1, so that the effect
3Digital Cancellation
DAC PA
BP Filter
x
IQ Mixer
xDAC xIQ xPA
RF Cancellation
BP Filter
LNAADC
hSI
yRXyLNAyIQ
y
IQ Mixer
Local
Oscillator
Fig. 1. Block diagram of a full-duplex transceiver with active RF SI cancellation and digital SI cancellation. A few components have been omitted for
simplicity, a more detailed diagram can be found in [10].
of phase noise becomes negligible. As such, the SI cancellation
signal yˆSI[n] can be constructed as [9], [10]:
yˆSI[n] =
P∑
p=1,
p odd
L−1∑
l=0
hˆp[l] (K1x[n− l] +K2x∗[n− l])p . (3)
where the parameters hˆp[l] ∈ C, l ∈ {0, . . . , L − 1}, p ∈
{1, 3, . . . , P}, model the joint effect of hˆSI[l] and the memory
effects introduced by the PA for the harmonic of order p,
and K1 and K2 are parameters that model the IQ imbalance.
Only odd values for p are considered because even harmonics
typically lie out-of-band and are filtered out by the transmitter
and receiver BP filters. With some arithmetic manipulations
yˆSI[n] can be re-written as [9], [10]:
yˆSI[n] =
P∑
p=1,
p odd
p∑
q=0
L−1∑
l=0
hˆp,q[l]x[n− l]qx∗[n− l]p−q, (4)
where the parameters hˆp,q[l] ∈ C capture the joint effect of
hˆp[l] and of the IQ imbalance parameters K1 and K2. The
model in (4) is linear with respect to the parameters hˆp,q[l],
and therefore, similarly to linear SI estimation, the parameters
hˆp,q[l] can be estimated based on training samples using some
variant of the LS estimation algorithm. The basis functions of
the polynomial model in (4) are defined as:
BFp,q(x) = xq(x∗)p−q. (5)
The number of distinct basis functions in (4) is [10]:
NBF =
L
4
(P + 1) (P + 3) . (6)
Using (5), the expression for yˆ[n] in (4) can be re-written in
a more compact form:
yˆSI[n] =
P∑
p=1,
p odd
p∑
q=0
L−1∑
l=0
hˆp,q[l]BFp,q (x[n− l]) . (7)
We note that linear cancellation is a special case of the
polynomial model in (7) when only considering the single
term for p = 1 and q = 1.
C. Computational Complexity
Assuming that each complex multiplication is implemented
using three real-valued multiplications and five real-valued
additions and that each complex addition is implemented using
two real-valued additions, it can directly be deduced from
(2) that the total number of real-valued multiplications and
additions that are required by the linear SI canceller is:
NADD,lin = 7L− 2, (8)
NMUL,lin = 3L. (9)
Moreover, if we ignore the computation of the basis functions
for simplicity,1 the total number of real-valued multiplications
and additions that are required by the polynomial SI canceller
(which also includes the linear cancellation term) is [22]:
NADD,poly =
7
4
L (P + 1) (P + 3)− 2, (10)
NMUL,poly =
3
4
L (P + 1) (P + 3) . (11)
We note that the expression for NADD,poly in our previous work
of [22] erroneously ignored the five real-valued additions that
are required to implement each complex multiplication. As
such, the actual complexity of the polynomial canceller is even
higher than that reported in [22].
III. NEURAL NETWORK NON-LINEAR DIGITAL
SELF-INTERFERENCE CANCELLATION
Polynomial SI cancellation models such as (7) work well in
practice but are often highly redundant in the sense that many
of the hˆp,q[l] parameters are very close to zero. NN-based SI
cancellers, on the other hand, can extract the essence of the
non-linear structure of the SI signal from training data, which
often significantly reduces the complexity of the SI cancella-
tion model [22]. A challenge when using NN cancellers is that
the NN training process is inherently noisy due to the use of
mini-batches for gradient estimation, which makes it difficult
to achieve a very accurate reconstruction of the SI signal [24].
To overcome this problem, [22] used a NN to reconstruct only
1We note that this simplification is justified in Section V.
4a particular part of the SI signal, while using conventional
linear cancellation for the remainder of the SI. Specifically, in
[22] the SI signal was conceptually decomposed into a linear
component and a non-linear component:
ySI[n] = ySI,linear[n] + ySI,nl[n]. (12)
The SI cancellation is carried out in two steps. First, linear
cancellation is used in order to reconstruct yˆSI,linear[n] as:
yˆSI,linear[n] =
L−1∑
l=0
hˆ[l]x[n− l]. (13)
The parameters hˆ[l] are obtained using LS estimation while
considering the substantially weaker signal yˆSI,nl[n] as noise.
The linear SI cancellation signal is then subtracted from the
SI signal in order to obtain:
ySI,nl[n] ≈ ySI[n]− yˆSI,linear[n]. (14)
The task of the NN is limited to reconstructing ySI,nl[n] based
on the appropriate x[n] samples.
As is common practice when training NNs, we normalize
the input and output training samples so that x[n] and ySI,nl[n]
have unit variance (i.e., the variance of the real part and the
variance of the imaginary part are both equal to 0.5) and zero
mean. To perform SI cancellation on the test data, the output of
the NN is denormalized using the mean and variance estimated
based on the training data.
A. Neural Network Structure
Due to the universal approximation theorem [30], a feed-
forward NN with one hidden layer, as depicted in Fig. 2,
is sufficient to reconstruct the non-linear SI signal. While
the work of [22] only considered feedforward NNs with one
hidden layer, it is possible to use any NN architecture to gen-
erate yˆSI,nl[n]. In particular, [24] employed a deep feedforward
NN and showed that using many layers with few neurons
per layer has significant computational complexity advantages
with respect to a shallow NN SI canceller that uses a single
layer with more neurons. In all cases and as shown in Fig. 2,
the cancellation NNs have 2L input nodes, which correspond
to the real and imaginary parts of the L delayed versions of
x[n], and two output nodes, which correspond to the real and
imaginary parts of the target yˆSI,nl[n] sample. In the following,
we denote the number of hidden layers by Nl and the number
of hidden nodes per layer Nh. Let the vector l0 contain the
2L inputs to the NN:
l0 =
[<{x[n]} ={x[n]} . . . <{x[n−L+1]} ={x[n−L+1]}]T .
(15)
The outputs of the first hidden layer neurons are given by:
l1 = f1 (W1l0 + b1) , (16)
where W1 is an Nh × 2L matrix containing the hidden
layer weights, b1 is an Nh × 1 vector containing the hidden
layer biases, and f1(·) is the (vectorized) non-linear activation
<{x[n]}
={x[n]}
<{x[n−1]}
={x[n−1]}
...
...
<{x[n−L+ 1]}
={x[n−L+ 1]}
...
<{yˆSI,non-linear[n]}
={yˆSI,non-linear[n]}
Fig. 2. Example of a neural network with one hidden layer for the
reconstruction of the non-linear component ySI,nl[n] of the SI signal [22].
function used in the first hidden layer. The outputs of the
neurons in the hidden layers 1 < l ≤ Nl are:
ll = fl (Wlll−1 + bl) , (17)
where Wl is an Nh × Nh matrix containing the hidden
layer weights, bl is an Nh × 1 vector containing the hidden
layer biases, and fl(·) is the (vectorized) non-linear activation
function used in hidden layer l. Finally, the outputs of the
output layer neurons are given by:
lNl+1 = fNl+1 (WNl+1lNl + bNl+1) , (18)
where WNl+1 is a 2×Nh matrix containing the output layer
weights, bNl+1 is a 2 × 1 vector containing the output layer
biases, and fNl+1 is the activation function used in the output
layer. As can be seen in Fig. 2, for lNl+1 we have:
lNl+1 =
[<{yˆnl[n]} ={yˆnl[n]}]T . (19)
The goal of the NN is to minimize the mean squared error
between the expected NN output and the actual NN output:
MSE =
1
N
N−1∑
n=0
(<{ySI,nl[n]} − <{yˆSI,nl[n]})2
+
1
N
N−1∑
n=0
(={ySI,nl[n]} − ={yˆSI,nl[n]})2 , (20)
where N is the total number of training samples. The MSE
in (20) is minimized by choosing appropriate values for
Wl, bl, l ∈ {1, . . . , Nl + 1}, using back-propagation [31].
B. Computational Complexity
Let us assume that the NN uses the popular ReLU activation
function in the hidden layers (which has similar complexity
to a real-valued addition (i.e., fl = ReLU(x) = max(0,x),
l ∈ {1, . . . , Nl}) and a linear activation function in the
output layer (i.e., fNl+1(x) = x). Then, the number of real-
valued multiplications and additions that are required by a NN
canceller with a single hidden layer and Nh hidden neurons
is [22]:
NADD,NN = (2L+ 3)Nh + 7L, (21)
NMUL,NN = (2L+ 2)Nh + 3L, (22)
5Neural 
network
wi,j and bj
yc(n)
Real and 
imaginary
-
+
Linear 
approximator
ĥ
y(n)
Real and 
imaginary
ŷnn(n)
ŷ(n)
+
Denormalization
x(n)
x(n-1)
…
x(n-L+1)
Real and 
imaginary
Fig. 3. High-level architecture on the NN-based SI cancellation scheme [1].
where the second term in both expressions comes from the
linear SI canceller that is required for the NN SI canceller to
work. Moreover, two additions are required to add the output
of the linear SI canceller with the output of the NN canceller.2
For the more general NN described in [24] with Nl hidden
layers with Nh neurons each, (21)-(22) can be generalized to:
NADD,NN = (2L+ 3 + (Nl−1)(Nh+1))Nh + 7L, (23)
NMUL,NN = (2L+ 2 + (Nl−1)Nh)Nh + 3L. (24)
IV. NEURAL NETWORK CANCELLER HARDWARE
ARCHITECTURE
In this section, we describe a generic hardware architecture
that can be used to implement both the shallow NN-based
SI canceller of [22] and deeper NN-based SI cancellers such
as the ones described in [24]. We first provide an overview
of the architecture, which is followed by a more detailed
explanation of each component. In Fig. 3, we show the high-
level architecture of a general NN-based canceller. The set
of baseband samples {x[n], . . . , x[n− L+ 1]} is given as an
input to a linear SI canceller and a NN-based SI canceller.
These two SI cancellers operate in parallel in order to generate
the linear and non-linear cancellation signals, respectively,
which are then added (after the denormalization step for the
NN) to produce the cancellation signal yˆSI[n].
A. Macro-Pipeline Architecture
As shown in the example of Fig. 4, in our architecture,
the canceller NN layers are mapped to macro-pipeline stages.
Each macro-pipeline stage requires several clock cycles to
compute its outputs and it can start its computations as soon as
valid outputs from the previous macro-pipeline stage become
available. Due to the high throughput requirements of the SI
cancellation task, we instantiate one macro-pipeline stage for
each layer in the NN that is used for cancellation.
Let NEl denote the number of neurons in layer l. We note
that NE0 = 2L, NENl+1 = 2, and NEl = Nh for all hidden
layers l ∈ {1, . . . ,NEl}. The goal of a macro-pipeline stage
2We note that these two additions were not included in [22], but we include
them here for the sake of accuracy.
is to compute ll using expressions of the form (16)-(18). Each
element j ∈ {0, . . . ,NEl − 1} of ll can be computed as:
ll[j] = fl
bl[j] + NEl−1−1∑
i=0
Wl[i, j]ll−1[i]
 . (25)
The architecture of each macro-pipeline stage is shown in
more detail in Fig. 5. More specifically, each macro-pipeline
stage contains an input interface, an array of NPE processing
elements (PEs), a weights-and-biases memory, a control unit,
and an output interface. We note that for simplicity, all
weights, biases, and partial sums have a common bit-width
of Q bits and saturation is used in case of an overflow. More
sophisticated quantization schemes are possible, but they are
beyond the scope of this work.
The NPE PEs, whose internal structure is shown in Fig. 6,
can be used to compute (25) over multiple clock cycles using
one of two possible schedules. In the neuron-by-neuron (NBN)
schedule, neurons are processed sequentially and each of the
NPE PEs computes a part of the sum in (25) for a given neuron
j. In the input-by-input (IBI) schedule, the inputs of layer l
(i.e., ll−1) are processed sequentially and the NPE PEs update
the sum in (25) with the term W[i, j]ll−1[i] for NPE distinct
neurons in parallel. As an NBN macro-pipeline stage generates
neuron output values sequentially, the optimal accelerator
structure consists of an NBN macro-pipeline stage always
being followed by an IBI macro-pipeline stage, allowing the
IBI stage to start performing computations once the output
of the first neuron of the preceding NBN stage has been
computed. Once all inputs have been processed by the IBI
stage, it immediately outputs multiple values to the NBN stage
which follows it. Having an NBN stage after another NBN
stage means that the second NBN stage would have to wait
for all outputs of the previous stage to be generated before any
processing can take place, and having an IBI stage followed by
another IBI stage would mean that the second IBI stage cannot
start processing before the first IBI stage has processed all its
inputs. This structure of NBN and IBI stages connected in an
alternating fashion masks a significant part of the latency and
reduces the number of interconnects between two consecutive
macro-pipeline stages. Since the exact architecture of each
macro-pipeline stage depends on the processing schedule,
we describe the details of the corresponding architectures
separately in the next two sections.
B. Neuron-by-Neuron Macro-Pipeline Architecture
1) Input Interface: The input interface consists of NPE
multiplexers, which route each of the NEl−1 elements of ll−1
to the correct PE.
2) Processing Elements: In the NBN schedule, each PE
is only associated with a single neuron, and therefore only a
single partial sum needs to be stored in each PE. Thus, the
PEs are simple multiply-and-accumulate (MAC) units and the
memory shown in Fig. 6 is, in fact, a single Q-bit register.
3) Control Unit: The main tasks of the control unit are
to distribute the computations to the PEs and to stall the
computations when no valid inputs are available or when the
6Hidden layer
din dout
valid_prev_i
stall_o
valid_o
stall_next_i
din dout
Output layer
din dout
clk rst clk rst
clk
rst
Fig. 4. Example of a macro-pipeline architecture with two stages for a neural
network with Nl = 1 hidden layers [1]. More macro-pipeline stages can be
added to the pipeline to implement neural networks of arbitrary depth Nl.
Output 
interface
(tree adder, 
act. function)
PE
(internal 
memory)
doutdin
Input 
interface
External 
memory 
access
Pipeline 
control
Control Unit
(counters, memory signals generation)
Weights
PE enable
Reset sum
Memory
signals
Input
selection
Weights 
and biases 
memory
NPE
Biases
Fig. 5. Block diagram of the macro-pipeline stage architecture [1].
following macro-pipeline stage is not ready to accept new
inputs. The computations are dispatched to the PEs as follows.
When NPE ≤ NEl−1, all NPE PEs are used to process a single
neuron at a time and NEl
⌈
NEl−1
NPE
⌉
clock cycles are required to
process all neurons. When NPE > NEl−1, we constrain NPE
so that NPE = k · NEl−1, k ∈ N, and hence k neurons are
processed in parallel and
⌈
NElNEl−1
NPE
⌉
clock cycles are required
to process all neurons.
4) Weight and Bias Memories: The weight and bias mem-
ories for layer l are used to store Wl and bl and they can
be written externally to re-configure the NN. The weights are
organized in a memory that is NPEQ bits wide so that all PEs
can be provided with data in parallel. A single word of the
weight memory contains NPE weight values corresponding to
k different neurons. The bias memory, on the other hand, has
a bit-width of kQ bits.
5) Output Interface: The output interface adds the partial
sums from the NPE PEs using an adder tree, it adds the
corresponding biases, and it applies the non-linear activation
function fl for each of the k neurons that are being processed
in parallel. A register is added between the PEs and the output
interface in order to reduce the critical path of the architecture.
Moreover, the output interface forwards the outputs of the
k neurons that are processed in parallel to the next macro-
pipeline stage.
6) Latency: In the remainder of this work, we select NPE
carefully so that both NEl−1NPE and
NElNEl−1
NPE
. With this setting,
an NBN macro-pipeline stage requires
Ll = NElNEl−1
NPE
+ 1 (26)
din +x
0
1
dout
Mem
weight_i
1
0
0
en_i
initSum_i
Partial sum
Memory 
interface
Fig. 6. Detailed view of the PE architecture that is used by both the NBN
and the IBI macro-pipeline stages [1].
clock cycles to produce all outputs of NN layer l. However,
one full set of outputs for a NN layer is actually produced
every NElNEl−1NPE cycles, so that the throughput of the NBN
macro-pipeline stage, measured in samples per clock cycle,
is
Tl = NPENElNEl−1 . (27)
Moreover, the first k outputs of an NBN macro-pipeline stage
become available after
Ll,first =
⌊
NEl−1
NPE
⌋
+ 1 (28)
clock cycles. Therefore, a potential IBI macro-pipeline stage
that follows can already start its computations after the Lfirst
clock cycles and that only k ≤ NEl outputs need to be
forwarded to the next stage at a time.
C. Input-by-Input Macro-Pipeline Architecture
1) Input & Output Interfaces: The input and output inter-
faces of the IBI macro-pipeline stage are similar to that of the
NBN macro-pipeline stage. The main difference is that the IBI
output interface forwards the outputs of all NEl neurons that
are processed in parallel to the next macro-pipeline stage.
2) Processing Elements: In the IBI schedule, each PE can
be associated with multiple neurons. Therefore, several partial
sums potentially need to be stored in each PE. Thus, the PEs
are MAC units and the memory shown in Fig. 6 has a
⌈
NEl
NPE
⌉
Q
bits.
3) Control Unit: In the IBI schedule, when NPE ≤ NEl,
all NPE PEs are used to update the NEl neurons of layer l
sequentially with a new input value l[i] and NEl−1
⌈
NEl
NPE
⌉
clock
cycles are required to process all neurons. When NPE > NEl,
we constrain NPE so that NPE = kNEl, k ∈ N, and k inputs
are processed in parallel. Hence,
⌈
NElNEl−1
NPE
⌉
clock cycles are
required to process all neurons.
4) Weight and Bias Memories: The weight and bias mem-
ories are similar to those of the NBN macro-pipeline stage.
A single word of the weight memory contains NPE weights
corresponding to k different neurons. The bias memory has a
bit-width of NElQ bits in the IBI macro-pipeline stage.
75) Latency: Similarly to the NBN schedule, we choose NPE
carefully so that both NElNPE and
NElNEl−1
NPE
are always integers.
Then, the latency and the throughput are
Ll = NElNEl−1
NPE
+ 1 (29)
clock cycles and
T = NPE
NElNEl−1
, (30)
samples per clock cycle, respectively. Moreover, all NEl
outputs of an IBI macro-pipeline stage become available
simultaneously after:
Ll,first = NElNEl−1
NPE
+ 1, (31)
clock cycles.
D. Overall Neural Network Canceller Architecture
The overall NN architecture consists of Nl macro-pipeline
stages with pipeline registers added between them. The first
hidden layer uses an NBN macro-pipeline stage and the second
hidden layer (or the output layer when Nl = 1) uses an
IBI macro-pipeline stage. Further layers use NBN and IBI
macro-pipeline stages in an alternating fashion as previously
discussed. The NE0 = 2L inputs l0 of the first NBN macro-
pipeline stage that implements the computations of the first
hidden layer are assumed to all be available in parallel. The
number of PEs instantiated for layer l is denoted by NPE,l.
The computations for the linear canceller are done in parallel
with the NN by instantiating a standard complex FIR filter
with NPE,linear complex-valued PEs. The latency of the linear
canceller in clock cycles
Llinear =
⌈
L
NPE,linear
⌉
. (32)
Since the linear canceller is not pipelined, it holds that
Tlinear = 1Llinear . The throughput of the overall NN canceller
architecture is:
T = min
(
min
l∈{1,...,Nl+1}
Tl, Tlinear
)
. (33)
Since it is typically not very costly in terms of resources to
ensure that Tlinear ≥ Tl, l ∈ {1, . . . , Nl+1}, in practice T
is usually limited by minl Tl. As opposed to the throughput,
the latency of the overall NN canceller is more complicated
to derive in general. However, in the special case where the
number of PEs for each layer l is chosen such that no stalling
happens and Nl + 1 is even, the latency can be calculated as:
L = max
(Nl+1)/2∑
l=1
L2l−1,first + L2l,Llinear
 . (34)
Finally, we note that the denormalization step shown in
Fig. 3 is constrained to scaling with powers of two, which
can be implemented efficiently with simple shifting operations,
both during training and during inference.
V. POLYNOMIAL CANCELLER HARDWARE ARCHITECTURE
Since, to the best of our knowledge, there are no published
implementations of polynomial SI cancellers in the literature,
we provide our own optimized reference implementation.
Our polynomial SI canceller architecture, which is shown in
Fig. 7, is largely based on the NN architecture since the main
computational tasks of the two cancellers are very similar (i.e.,
computation of weighted sums). The main differences are that
the input interface also computes the basis functions, that NCPE
complex PEs (CPEs) are used to perform computations on
complex values, and that there is only a single macro-pipeline
stage. In the remainder of this section, we explain how the
basis functions can be computed efficiently and we describe
the polynomial SI canceller in more detail.
A. Basis Function Computation
The computation of the NBF basis functions in (5) for each
cancellation sample seems like a cumbersome task. Fortu-
nately, we can show that the basis functions have a number
of properties that enable their efficient computation. First,
significant basis function re-use is possible. In particular, after
yˆSI[n−1] has been computed based on BFp,q(x[n−1−l]), l ∈
{0, . . . , L−1}, p ∈ {1, 3, . . . , P}, q ∈ {0, . . . , p}, the
basis functions for l ∈ {0, . . . , L−2} can be stored and
re-used for the computation of yˆSI[n]. As such, the only
new basis functions that need to be computed for yˆSI[n]
are BFp,q(x[n]), p ∈ {1, 3, . . . , P}, q ∈ {0, . . . , p}. This
requires L−14 (P + 1) (P + 3) memory elements, but reduces
the number of basis functions that need to be computed by a
factor of L from L4 (P + 1)(P + 3) to
1
4 (P + 1)(P + 3).
Moreover, the following proposition shows two additional
properties of the basis functions.
Proposition 1: For the basis functions in (5), it holds that:
1) BFp,q(x) = (BFp,p−q(x))
∗
2) BFp,q(x) = x2BFp−2,q−2(x)
Proof: Both properties follow from the definition of the
basis function in (5). Specifically, for 1) we have:
BFp,q(x) = xq(x∗)p−q
=
(
xp−q(x∗)p−(p−q)
)∗
= (BFp,p−q(x))
∗
, (35)
and for 2) we have:
BFp,q(x) = xq(x∗)p−q
= x2xq−2(x∗)p−2−(q−2)
= x2BFp−2,q−2(x). (36)
Property 1) enables a computation reduction by a factor of two
since for every p ∈ {1, 3, . . . , P}, it is sufficient to compute
BFp,q(x) only for q ∈
{
p+1
2 , . . . , p
}
and the remaining
basis functions for q ∈ {0, . . . , p−12 } can be obtained by
simple conjugation. Moreover, property 2) reveals an efficient
dynamic programming (DP) method to compute the basis
8Algorithm 1 Dynamic programming computation of basis
functions
1: Input: x[n]
2: Outputs: BFp,q(x[n]) for p ∈ {1, 3, . . . , P}, q ∈ {0, . . . , p}
3: BF1,0(x[n])← (x[n])∗
4: BF1,1(x[n])← x[n]
5: for p ∈ {3, 5, . . . , P} do
6: for q ∈ { p+1
2
, . . . , p
}
do
7: BFp,q(x[n])← (x[n])2 BFp−2,q−2(x[n])
8: BFp,p−q(x[n])← BFp,q(x[n])∗
9: end for
10: end for
functions for x[n], which is shown in Algorithm 1. Algo-
rithm 1 requires one multiplication to pre-compute (x[n])2 and
1
8 (P+1)(P+3)−2 multiplications for all executions of line 7.
The conjugation in line 8 does not require any multiplications
as it is a simple sign change of the imaginary part of
BFp,p−q(x). As such, the total number of multiplications to
compute the basis functions for a baseband sample x[n] is:
NMUL,BF =
1
8
(P + 1)(P + 3)− 1. (37)
One downside of the DP approach is that only the inner
loop in Algorithm 1 can be parallelized. However, in most
typical applications we have P ≤ 9, so that the outer loop in
Algorithm 1 is executed very few times. We note that, due to
the efficiency of Algorithm 1, NMUL,BF is significantly smaller
than NMUL,poly, which justifies ignoring the multiplications of
the basis function computations in (11) for simplicity.
B. Polynomial Canceller Architecture
We use a high-level structure that is similar to the NN-based
cancellers in Fig. 3 in the sense that linear cancellation is
done in parallel to non-linear cancellation and the polynomial
SI canceller focuses only on the non-linear part of the SI
signal. Since most of the SI signal is linear, removing the
linear term separately significantly reduces the dynamic range
of the values within the polynomial SI canceller, which in turn
allows us to reduce the common quantization bit-width Q for
the real and the imaginary parts of the involved quantities.
1) Input & Output Interfaces: The input interface consists
of NCPE multiplexers, which route each of the NBF BFs to
the correct CPE in order to compute parts of the sum in (7).
As mentioned previously, the input interface also computes
the BFs using NCPE,BF CPEs. Since only the inner loop in
Algorithm 1 can be parallelized, it is reasonable to constrain
NPE,BF so that NPE,BF ≤ P+12 . The number of clock cycles
required to compute all BFs with NPE,BF PEs is:
LBF = 1 +
P∑
p=3,
p odd
⌈
p+ 1
2NCPE,BF
⌉
, (38)
where one clock cycle is used to pre-compute x2 and the
result (as well as x∗) are stored in two 2Q-bit registers. The
L−1
4 (P + 1) (P + 3) BFs that are re-used are stored in a
circular buffer.
Output 
block
(tree adder)
Basis 
functions 
calculator / 
memory
Complex 
PE
(single 
register)
din dout
Complex 
weights 
memory
External 
memory 
access
Pipeline 
control
Control Unit
(counters, memory signals generation)
Weights
PE enable
Reset sum
Memory
signals
Basis function
selection
EP = Nb 
CPEs
Fig. 7. Block diagram of the polynomial canceller architecture.
The output interface consists of an adder tree that adds up
the partial sums stored in the NCPE CPEs in order to produce
the final result.
2) Complex Processing Elements: The NCPE CPEs are
complex MAC units with a Q-bit register to store partial sums.
The complex MAC units are implemented using three real-
valued multipliers and five real-valued adders.
3) Control Unit: Similarly to the NN-based canceller, the
main tasks of the control unit are to distribute the computations
to the CPEs and to stall the computations when no valid inputs
are available. The control unit schedules the operation so that
the CPEs first compute the terms of (7) that are based on
BFs that are already available in the circular buffer. In the
meantime, the input interface computes the 14 (P + 1)(P + 3)
BFs that depend on the new sample x[n].
4) Parameter Memory: The parameter memory is used
to store the complex-valued hˆp,q parameters of the polyno-
mial canceller. The memory contains
⌈
NBF
NCPE
⌉
words that are
2QNCPE bits wide so that all NCPE CPEs can be provided with
the parameters in parallel.
5) Latency: In most practical cases, the latency of comput-
ing the new BFs is significantly smaller than the latency of
computing the terms of (7), so that the latency of this operation
is masked entirely and can be ignored. For example, for P = 7
and NCPE,BF = 1, according to (38) it takes only 10 clock
cycles to compute the new BFs. Setting NCPE,BF = 2 reduces
the number of required cycles to 6. As such, we can safely
assume that the latency of the polynomial canceller is limited
by the computation of (7). The latency of the polynomial SI
canceller is given by:
Lpoly =
⌈
NBF
NCPE
+ 1,
⌉
, (39)
where one clock cycle is required by the adder tree in the
output interface to produce the final output. Since a pipeline
register is inserted before the adder tree of the output interface,
the throughput of the polynomial SI canceller, measured in
samples per clock cycle, is given by Tpoly = 1Lpoly .
VI. NUMERICAL AND HARDWARE IMPLEMENTATION
RESULTS
In this section, we compare the polynomial SI canceller with
the NN-based SI cancellers in terms of their SI cancellation
9−10 −8 −6 −4 −2 0 2 4 6 8 10−170
−160
−150
−140
−130
−120
−110
Frequency (MHz)
Po
w
er
Sp
ec
tr
al
D
en
si
ty
(d
B
m
/H
z)
SI Signal (−42.8 dBm) Linear (−80.6 dBm)
Poly P = 5 (−87.3 dBm) Shallow NN (−87.2 dBm)
Deep NN (−87.1 dBm) Noise Floor (−90.8 dBm)
Fig. 8. Comparison of the SI cancellation performance of polynomial and
NN-based cancellers. The pre-digital-cancellation SI signal and the receiver
thermal noise floor are also shown for comparison.
performance and their hardware implementation complexity.
Specifically, we first provide a performance comparison of
the polynomial SI canceller with the shallow NN SI canceller
of [22] as well as with an instance of a deep NN SI canceller
of [24]. We then present a comparison of FPGA and ASIC
implementation results for the polynomial SI canceller and
the NN-based SI cancellers in an equi-performance scenario.
A. Self-Interference Cancellation Performance Comparison
1) Comparison Setup: The complexity expressions for the
polynomial SI canceller in (10)-(11) and the NN SI cancellers
in (23)-(24) can not be compared directly because they contain
different sets of parameters. Thus, in order to perform a fair
complexity comparison, we select values for L, P , Nl, and
Nh so that the compared polynomial and NN cancellers have
as similar SI cancellation performance as possible.
For the performance evaluation, we use the dataset that was
used in [22], which consists of a 10 MHz QPSK-modulated
OFDM SI signal sampled at 20 MHz that is generated using an
actual full-duplex testbed with a transmit power of 10 dBm.3
The dataset contains 20, 000 time-domain SI baseband sam-
ples, out of which 90% is used for training and 10% for the
evaluation of the SI performance. For NN training, we use a
mini-batch size of B = 32 and the Adam optimizer [32] with
a learning rate of λ = 0.004.
2) Results: In Fig. 8, we show the power spectral density
(PSD) of the received SI signal ySI[n] before any SI cancel-
lation is performed, the PSD of the received signal when no
transmission takes place (i.e., the effective noise floor of the
receiver), as well as the PSDs of the SI signals after linear
SI cancellation and after non-linear SI cancellation with the
polynomial and NN-based cancellers. The results are obtained
3The dataset is available at https://github.com/abalatsoukas/fdnn.
2 4 6 8 10 12 14 16 18 20
0
1
2
3
4
5
6
Training Epoch
N
on
-L
in
ea
r
SI
C
an
ce
lla
tio
n
(d
B
)
Nl = 1: Training Test
Nl = 5: Training Test
Fig. 9. Training convergence of shallow (Nl=1) and deep (Nl=5) NN-based
SI cancellers with Nh=18 and Nh=6 neurons per layer, respectively.
for L = 13 for all SI cancellers and P = 5 for the polynomial
SI canceller. Moreover, the shallow NN has Nl = 1 hidden
layer with Nh = 18 neurons, while the deep NN has Nl = 5
hidden layers with Nh = 6 neurons per layer. These parameter
values are chosen as follows. First, we select L by slowly
increasing its value until no further linear cancellation gains
are obtained. We then use the polynomial SI canceller and
we increase P until the gain in SI cancellation performance
becomes very small. When going from P = 5 to P = 7, the SI
cancellation only improves by 0.3 dB while the computational
complexity almost doubles, so that P = 5 provides a sensible
complexity-performance trade-off. We then use the same value
of L for the NN-based SI cancellers and we select Nl and Nh
to match the performance of the polynomial SI canceller.
We observe that, with these parameter settings, all SI
cancellers indeed achieve very similar performance and can
cancel the SI very close to the receiver noise floor, with the
polynomial canceller being 0.4 dB and 0.5 dB better than the
shallow and the deep NN-based cancellers, respectively. We
note that, in light of our recent results in [24, Fig. 5], using
P = 5 for the polynomial SI canceller results in a more fair
comparison than using P = 7 as was done in [22]. However,
as we show in the sequel, even in this case there are very clear
advantages in terms of the implementation complexity when
using an NN-based canceller.
In Fig. 9, we show the training convergence behavior of
the shallow and deep NN SI cancellers. We observe that
the shallow and deep NNs require only 7 and 12 training
epochs to achieve more than 5 dB of non-linear cancellation,
respectively. Training for up to 20 epochs further increases the
cancellation performance to approximately 5.5 dB, but there
are clear diminishing returns. Moreover, we observe that the
cancellers have similar performance on the training and test
samples, meaning that there are no overfitting issues. We note,
however, that during our experiments we observed that the
deep NN is more sensitive to the weight initialization.
In Table I, we show the complexity of the three SI cancellers
in terms of the number of real-valued multiplications and
10
TABLE I
COMPARISON OF THE SI CANCELLATION PERFORMANCE AND THE
COMPLEXITY OF POLYNOMIAL AND NN-BASED CANCELLERS.
Poly. P = 5 Shallow NN Deep NN
Cancellation (dB) 44.4 44.4 44.3
L 13 13 13
P 5 n/a n/a
Nl n/a 1 5
Nh n/a 18 6
Real Add. 1090 613 433
Real Mult. 468 543 351
additions given by (10)-(11) and (23)-(24). We observe that
the polynomial SI canceller requires the largest number of
additions, while the shallow NN SI canceller requires the
largest number of multiplications. The deep NN SI canceller,
on the other hand, achieves a significant complexity reduction
as it requires 25% fewer multiplications and 60% fewer
additions than the polynomial SI canceller.
B. FPGA and ASIC Implementation Results
1) Comparison Setup: In Section VI-A1, we already
showed that using L = 13 for all cancellers, P = 5 for the
polynomial canceller, a single hidden layer with Nh = 18
neurons for the shallow NN canceller, and Nl = 5 hidden
layers with Nh = 6 for the deep NN canceller, leads to
practically identical SI cancellation performance. However, in
order to perform a meaningful comparison of FPGA and ASIC
implementation results, the quantization bit-width Q for the
different cancellers also needs to be selected to individually
minimize the implementation complexity while keeping the
performance of the SI cancellers similar. In Fig. 10, we show
the cancellation performance for the polynomial SI canceller
and the NN SI cancellers as a function of the quantization bit-
width Q. We observe that both NN SI cancellers generally
require a lower quantization bit-width Q compared to the
polynomial SI canceller to achieve the same SI cancellation
performance. Moreover, for the hardware implementation re-
sults presented in this section, we can choose Q = 17 for the
shallow NN SI canceller, Q = 19 for the deep NN SI canceller,
and Q = 20 for the polynomial SI canceller, as this choice
leads to effectively identical SI cancellation performance and a
very small loss with respect to the corresponding floating-point
implementations for all cancellers. The deep NN SI canceller
requires two additional integer bits compared to the shallow
NN SI canceller, due to larger absolute output values in the
hidden layers. This effectively shifts the bit-width versus SI
cancellation performance curve of the deep NN SI canceller by
two bits to the right compared to the shallow NN SI canceller.
For the shallow NN SI canceller, we set NPE,1 = 52 and
NPE,2 = 4 so that T1 = T2 = 1/9. With this setting, the macro-
pipeline is perfectly balanced and one cancellation sample is
produced every 9 clock cycles. Furthermore, NCPE,linear = 2
CPEs are instantiated for the NN SI canceller to ensure that the
linear cancellation step can be completed in the same number
of cycles. For the deep NN SI canceller, we set NPE,1 = 26
15 16 17 18 19 20 21 22 23 24 25 26
39
40
41
42
43
44
45
Bit-width Q (bits)
SI
C
an
ce
lla
tio
n
C
dB
(d
B
)
NN (floating-point)
NN (fixed-point)
Deep NN (floating-point)
Deep NN (fixed-point)
Poly. P = 5 (floating-point)
Poly. P = 5 (fixed-point)
Fig. 10. Total SI cancellation for the polynomial and NN SI cancellers as
a function of the datapath bit-width Q. The circled points for each canceller
are used for the FPGA and ASIC implementation results in Section VI-B.
TABLE II
FPGA IMPLEMENTATION RESULTS (VIRTEX-7 XC7VX485). THE
POLYNOMIAL CANCELLER USES L = 13 AND P = 5, THE SHALLOW NN
CANCELLER USES Nl = 1 AND Nh = 18, THE DEEP NN CANCELLER USES
Nl = 5 AND Nh = 6.
Poly. (P = 5) Shallow NN Deep NN
Slices 2099 (2.77%) 1700 (2.24%) 2212 (2.91%)
LUT (logic) 5693 (1.88%) 2833 (0.93%) 3761 (1.24%)
LUT (RAM) 884 (0.29%) 1678 (0.55%) 1767 (0.58%)
Registers 2462 (0.41%) 2625 (0.43%) 3887 (0.64%)
DSP Slices 42 (1.50%) 62 (2.21%) 61 (2.07%)
Frequency (MHz) 85.5 86.6 69.7
T/P (Msamples/s) 6.6 9.6 11.6
T/P (cycles/sample) 13 9 6
Latency (ns) 152 115 304
Latency (cycles) 13 10 21
for the first hidden layer, NPE,l = 6 for the remaining hidden
layers, and NPE,6 = 2 for the output layer so that Tl = 1/6
for all l ∈ {1, . . . , 6} and the macro-pipeline is balanced.
The deep NN SI canceller uses NCPE,linear = 3 CPEs for the
linear canceller due to the increased throughput requirements.
The shallow NN SI canceller thus requires a total of 56 PEs
and the deep NN SI canceller requires a total 52 PEs, but
the deep NN SI canceller requires one more CPE than the
shallow NN canceller. Finally, for the polynomial canceller,
we use NCPE = 12 complex PEs so that the 156 complex
multiplications required to compute (7) for L = 13 and P = 5
can be carried out in Lpoly = 13 clock cycles. We also use
NCPE,BF = 1 CPE for the BF computation, meaning that
LBF = 10. Since Lpoly > LBF, one cancellation sample is
produced every Lpoly = 13 clock cycles.
2) FPGA Implementation Results: In Table II, we show
place-and-route (PAR) results on a Xilinx Virtex-7 XC7VX485
(speed grade -2) FPGA, which contains a total of 75.9k slices,
303.6k LUTs, 607.2k flip-flops, and 2.8k DSP slices. A clock
frequency target of 100 MHz is used for all cancellers.
We observe that the shallow NN SI canceller has a lower
slice and LUT as logic utilization than the polynomial SI
canceller and a 45% higher throughput. The higher throughput
11
TABLE III
ASIC IMPLEMENTATION RESULTS (28 NM FD-SOI). THE POLYNOMIAL CANCELLER USES L = 13 AND P = 5, THE SHALLOW NN CANCELLER USES
Nl = 1 AND Nh = 18, THE DEEP NN CANCELLER USES Nl = 5 AND Nh = 6. WE USE SLOW-SLOW CORNERS, AN OPERATING VOLTAGE OF 0.7 V, AND
AN OPERATING TEMPERATURE OF 125◦ C.
Target Maximum Throughput 20 Msamples/s Throughput
Poly. (P = 5) Shallow NN Deep NN Poly. (P = 5) Shallow NN Deep NN
Area (mm2) 0.179 0.255 0.187 0.155 0.238 0.162
Area (kGE) 366131 521 654 380 967 317165 485 142 330 125
Frequency (MHz) 354 326 351 270 180 120
Throughput (Msamples/s) 27.2 36.2 58.5 20.0 20.0 20.0
Throughput (cycles/sample) 13 9 6 13 9 6
Latency (ns) 37 31 60 48 56 175
Latency (cycles) 13 10 21 13 10 21
Total Power (mW) 49.8 61.9 54.5 31.3 38.0 22.1
Internal Power (mW) 26.0 35.7 30.0 16.9 21.1 11.7
Switching Power (mW) 22.4 24.4 23.2 13.3 15.4 9.4
Leakage Power (mW) 1.4 1.8 1.3 1.1 1.5 1.0
Hardware Efficiency (Msamples/s/mm2) 152.0 142.0 312.8 129.0 84.0 123.5
Energy Efficiency (nJ/sample) 1.8 1.7 0.9 1.6 1.9 1.1
of the shallow NN SI canceller comes both from a lower
number of cycles per sample and from a slightly higher
operating frequency compared to the polynomial SI canceller.
We also note that the polynomial SI canceller requires approx-
imately 32% fewer DSP slices than both the shallow and the
deep NN SI canceller. The deep NN SI canceller uses more
resources than the shallow NN SI canceller, but it has a 77%
higher throughput than the polynomial SI canceller. The main
additional cost for the deep NN SI canceller compared to the
shallow NN SI canceller comes from registers and LUTs used
as logic. Even though the deep NN SI canceller is not able
to achieve a clock frequency as high as the other cancellers,
it still has the highest throughput. We also observe that the
shallow NN SI canceller has the lowest latency, followed by
the polynomial SI canceller and the deep NN SI canceller.
3) ASIC Implementation Results: In Table III, we present
ASIC implementation results for the polynomial SI canceller
and the two NN SI cancellers using a 28 nm FD-SOI technol-
ogy. We target two different points in terms of the operating
frequency, namely, a maximum throughput point and a point
where each SI canceller achieves a throughput of exactly
20 MS/s, which is sufficient for the dataset that we consider.
In both cases, we use slow-slow corners, an operating voltage
of 0.7 V, and an operating temperature of 125◦ C. For the
power results, post-PAR simulations are used both to verify
the design and to accurately estimate the switching activity.
We observe that, for the maximum throughput operating
point, the NN SI cancellers are both significantly faster and
more energy efficient than the polynomial SI canceller. How-
ever, the polynomial SI canceller is generally smaller and has
a lower power consumption. More specifically, the shallow
NN SI canceller is 33% faster but also 30% larger than the
polynomial SI canceller. As a result, its area efficiency is 7%
lower than that of the polynomial SI canceller. The deep NN
SI canceller, on the other hand, is only 4% larger than the
polynomial SI canceller and at the same time 115% faster and
has a 106% better area efficiency. The energy efficiency of the
NN SI cancellers is also significantly better compared to the
polynomial SI canceller. Specifically, the shallow and deep NN
SI cancellers improve the energy efficiency by 6% and 50%
compared to the polynomial SI canceller, respectively. Finally,
we observe that the deep NN SI canceller has the worst latency
(60 ns), followed by the polynomial SI canceller (37 ns) and
the shallow NN canceller (31 ns).
As mentioned previously, for the dataset that we use in
this work, a throughput of 20 MS/s is sufficient for real-time
operation. We observe that the relaxed timing requirements,
in this case, reduce the area of the polynomial SI canceller
by 13%, the shallow NN SI canceller by 7% and the deep
NN SI canceller by 13% compared to the results for the
maximum throughput operating point. Interestingly, the en-
ergy efficiency per sample only improves for the polynomial
canceller, whereas it becomes slightly worse for the two NN
SI cancellers. Nevertheless, the deep NN SI canceller is still
30% more energy efficient than the polynomial SI canceller
and only 4% less area efficient. Moreover, at this operating
point, the polynomial canceller has the lowest latency (48 ns)
and the deep NN canceller has the lowest power consumption.
VII. CONCLUSION
In this paper, we presented a high-throughput hardware
architecture for a NN-based SI cancellation scheme for full-
duplex radios. We also presented, to the best of our knowledge,
the first efficient hardware architecture for polynomial SI
cancellation in the literature, which we used as a comparison
baseline for the NN-based SI cancellers. Our implementation
results show that the NN SI cancellers have significantly lower
computational complexity than a conventional polynomial SI
canceller, which translates into substantial area and energy
savings when the schemes are implemented in hardware.
Specifically, an ASIC implementation of a deep NN-based SI
canceller has up to 2.1× and 2× better hardware efficiency and
energy effiency when compared to a conventional polynomial
SI canceller, respectively.
12
REFERENCES
[1] Y. Kurzo, A. Burg, and A. Balatsoukas-Stimming, “Design and im-
plementation of a neural network aided self-interference cancellation
scheme for full-duplex radios,” in Asilomar Conf. on Signals, Systems
and Computers, Oct. 2018, pp. 589–593.
[2] M. Jain, J. I. Choi, T. Kim, D. Bharadia, S. Seth, K. Srinivasan, P. Levis,
S. Katti, and P. Sinha, “Practical, real-time, full duplex wireless,” in Int.
Conf. on Mobile Computing and Networking. ACM, 2011, pp. 301–312.
[3] M. Duarte, C. Dick, and A. Sabharwal, “Experiment-driven characteriza-
tion of full-duplex wireless systems,” in IEEE Trans. Wireless Commun.,
vol. 11, no. 12, Dec. 2012, pp. 4296–4307.
[4] D. Bharadia, E. McMilin, and S. Katti, “Full duplex radios,” in ACM
SIGCOMM, 2013, pp. 375–386.
[5] A. Balatsoukas-Stimming, A. C. M. Austin, P. Belanovic, and A. Burg.,
“Baseband and RF hardware impairments in full-duplex wireless sys-
tems: experimental characterisation and suppression,” EURASIP J. on
Wireless Comm. and Netw., vol. 2015, no. 142, 2015.
[6] D. Korpi, L. Anttila, V. Syrjala, and M. Valkama, “Widely linear
digital self-interference cancellation in direct-conversion full-duplex
transceiver,” IEEE J. Sel. Areas Commun., vol. 32, no. 9, pp. 1674–
1687, Sep. 2014.
[7] A. Sahai, G. Patel, C. Dick, and A. Sabharwal, “On the impact of phase
noise on active cancelation in wireless full-duplex,” IEEE Trans. Veh.
Technol., vol. 62, no. 9, pp. 4494–4510, Nov. 2013.
[8] V. Syrjala, M. Valkama, L. Anttila, T. Riihonen, and D. Korpi, “Analysis
of oscillator phase-noise effects on self-interference cancellation in full-
duplex OFDM radio transceivers,” IEEE Trans. Wireless Commun.,
vol. 13, no. 6, pp. 2977–2990, June 2014.
[9] L. Anttila, D. Korpi, E. Antonio-Rodrı`guez, R. Wichman, and
M. Valkama, “Modeling and efficient cancellation of nonlinear self-
interference in MIMO full-duplex transceivers,” in IEEE Globecom
Workshops, 2014, pp. 777–783.
[10] D. Korpi, L. Anttila, and M. Valkama, “Nonlinear self-interference can-
cellation in MIMO full-duplex transceivers under crosstalk,” EURASIP
J. on Wireless Comm. and Netw., vol. 2017, no. 1, p. 24, Feb. 2017.
[11] P. P. Campo, D. Korpi, L. Anttila, and M. Valkama, “Nonlinear digital
cancellation in full-duplex devices using spline-based Hammerstein
model,” in IEEE Globecom Workshops, Dec. 2018.
[12] T. O‘Shea and J. Hoydis, “An introduction to deep learning for the
physical layer,” IEEE Trans. Cogn. Commun. and Networking, vol. 3,
no. 4, pp. 563–575, Dec. 2017.
[13] T. Wang, C. Wen, H. Wang, F. Gao, T. Jiang, and S. Jin, “Deep
learning for wireless physical layer: Opportunities and challenges,”
China Communications, vol. 14, no. 11, pp. 92–111, Nov. 2017.
[14] Q. Mao, F. Hu, and Q. Hao, “Deep learning for intelligent wireless
networks: A comprehensive survey,” IEEE Comm. Surveys Tutorials,
vol. 20, no. 4, pp. 2595–2621, Fourth Quarter 2018.
[15] D. Gunduz, P. de Kerret, N. D. Sidiropoulos, D. Gesbert, C. Murthy,
and M. van der Schaar, “Machine learning in the air,” Apr. 2019.
[Online]. Available: https://arxiv.org/abs/1904.12385
[16] Z. Qin, H. Ye, G. Y. Li, and B.-H. F. Juang, “Deep learning in physical
layer communications,” IEEE Wireless Commun., vol. 26, no. 2, Apr.
2019.
[17] A. Balatsoukas-Stimming and C. Studer, “Deep unfolding for communi-
cations systems: A survey and some new directions,” in IEEE Workshop
on Sig. Proc. Systems (SiPS), Oct. 2019.
[18] C. Tarver, L. Jiang, A. Sefidi, and J. Cavallaro, “Neural network DPD
via backpropagation through a neural network model of the PA,” in
Asilomar Conf. on Signals, Systems and Computers, Nov. 2019.
[19] R. Hongyo, Y. Egashira, T. M. Hone, and K. Yamaguchi, “Deep neural
network-based digital predistorter for Doherty power amplifiers,” IEEE
Microwave and Wireless Comp. Letters, vol. 29, no. 2, pp. 146–148,
Feb. 2019.
[20] O. Ploder, O. Lang, T. Paireder, and M. Huemer, “An adaptive ma-
chine learning based approach for the cancellation of second-order-
intermodulation distortions in 4G/5G transceivers,” in IEEE Vehicular
Technology Conf. (VTC2019-Fall), Sep. 2019.
[21] C. Ha¨ger and H. D. Pfister, “Nonlinear interference mitigation via
deep neural networks,” in Optical Fiber Commun. Conf. and Exposition
(OFC), Mar. 2018, pp. 1–3.
[22] A. Balatsoukas-Stimming, “Non-linear digital self-interference cancella-
tion for in-band full-duplex radios using neural networks,” in IEEE Int.
Workshop on Signal Proc. Advances in Wireless Commun. (SPAWC),
Jun. 2018, pp. 1–5.
[23] H. Guo, J. Xu, S. Zhu, and S. Wu, “Realtime software defined self-
interference cancellation based on machine learning for in-band full du-
plex wireless communications,” in Int. Conf. on Computing, Networking
and Commun. (ICNC), Mar. 2018, pp. 779–783.
[24] A. T. Kristensen, A. Burg, and A. Balatsoukas-Stimming, “Advanced
machine learning techniques for self-interference cancellation in full-
duplex radios,” in Asilomar Conf. on Signals, Systems and Computers,
Nov. 2019.
[25] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
FPGA-based accelerator design for deep convolutional neural networks,”
in ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, Feb.
2015, pp. 161–170.
[26] Y. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient
reconfigurable accelerator for deep convolutional neural networks,”
IEEE J. of Solid-State Circuits, vol. 52, no. 1, pp. 127–138, Jan. 2017.
[27] F. A. Aoudia and J. Hoydis, “Towards hardware implementation of neu-
ral network-based communication algorithms,” in IEEE Int. Workshop
on Signal Proc. Advances in Wireless Commun. (SPAWC), Jul. 2019.
[28] I. Wodiany and A. Pop, “Low-precision neural network decoding of
polar codes,” in IEEE Int. Workshop on Signal Proc. Advances in
Wireless Commun. (SPAWC)), Jul. 2019.
[29] C. Tarver, A. Balatsoukas-Stimming, and J. Cavallaro, “Design and
implementation of a neural network based predistorter for enhanced
mobile broadband,” in IEEE Int. Workshop on Signal Processing Systems
(SiPS), Oct. 2019.
[30] K. Hornik, “Approximation capabilities of multilayer feedforward net-
works,” Neural Networks, vol. 4, no. 2, pp. 251–257, 1991.
[31] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning represen-
tations by back-propagating errors,” Nature, vol. 323, pp. 533–536, Oct.
1986.
[32] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
in Int. Conf. for Learning Representations (ICLR), May 2015.
