Hardware and latency optimisation for 5G digital pre-distortion by Byrne, Declan et al.
Hardware And Latency Optimisation for 5G Digital
Pre-Distortion




Abstract—In modern radio frequency (RF) transceivers the
power amplifier (PA) is a central component in terms of power
consumption. Achieving efficient performance in this component
results in the PA output signal becoming distorted. Linearisation
can be performed using techniques such as Digital Pre-Distortion
(DPD). The pre-distorter operation of a DPD system involves
the constant computation of a distorted signal to ensure linear
operation of the nonlinear power amplifier. In this work a
novel polynomial evaluation scheme is proposed to optimise
the pre-distorter operation within a DPD system. Improvements
to latency and hardware requirements are possible with new
techniques. Validation of the proposed design was conducted
using FPGA implementations and compared to incumbent
pipelined solutions for both low latency and hardware efficiency.
The proposed method indicated hardware savings of 67.8%, while
operating 58.7% faster, compared to an existing implementation.
Index Terms—digital pre-distortion, polynomial evaluation,
5G, FPGA, DSP.
I. INTRODUCTION
Modern cellular networks employ PAs to amplify
communication signals before transmission. In order to save
power and reduce heat dissipation, PAs can be driven in
a region of efficient operation. However, this introduces
nonlinear distortion to the PA output signal. Using linearisation
techniques, distortion can be mitigated, allowing the PA to
operate efficiently and linearly.
Digital Pre-Distortion (DPD) is the predominant
linearisation technique used in cellular communications
[1]. This method weights the PA input signal inversely to the
gain of the PA. This results in the PA output signal becoming
linearised.
DPD is commonly performed by employing a learning
architecture, such as Fig. 1, to train a nonlinear basis function
which is then applied to the PA input. Polynomial models,
derived from the Volterra series, are typically chosen and
referred to as DPD structures [2]. PA input and output signals
are sampled and used alongside a DPD structure to extract
coefficients of the polynomial function. This is performed in
the DPD system element known as the post-distorter. These
coefficients are then applied in the pre-distorter continuously
to pre-distort the incoming PA input signal.
This publication has emanated from research conducted with the financial
support of Science Foundation Ireland (SFI) and is co-funded under the
European Regional Development Fund under Grant Number 13/RC/2077.
To date, look up table (LUT) or pipelined implementations
have been used [3]. Pre-distorters implemented using LUTs
are suitable for low precision linearisation. Pipelined solutions
meet the need for high precision linearisation, but can require






Fig. 1: Indirect learning architecture
5G cellular communications specifications have highlighted
power efficient and low latency solutions as targeted goals for
next generation wireless networks [5]. In a DPD system, the
pre-distorter block is responsible for the majority of power
consumption and latency. It is active continuously after the
coefficients are trained. In contrast, the post-distorter block is
only active periodically to train a set of coefficients.
Employing DPD structures using high orders of nonlinearity
are necessary to linearise high efficiency PAs [6]. However,
unoptimised DPD solutions tackling high order distortion do
not scale well in terms of power consumption [1].
As pre-distorters apply polynomial based functions,
optimisation can be achieved using polynomial evaluation
methods. Polynomial evaluation schemes are algorithms which
allow polynomials to be expressed in a manner such that
hardware requirements and/or latency are reduced [4], [7].
The optimal scheme in terms of hardware, and thereby power,
efficiency, Horner’s method, has been discovered and well
researched [8], [9]. However, this scheme is sequential and
slow, which is counterproductive to 5G goals.
Current polynomial evaluation techniques were originally
developed with a focus on optimising basic univariate
polynomials. Deviating from this format (including memory
taps and/or multivariate coefficients, pruning coefficients)
can restrict the performance of these schemes and is
common among polynomial functions applied within DPD
pre-distorters.
978-1-7281-2800-9/19/$31.00 ©2019 IEEE
Authorized licensed use limited to: Maynooth University Library. Downloaded on March 16,2021 at 17:37:28 UTC from IEEE Xplore.  Restrictions apply. 
The contribution of this paper is a polynomial evaluation
scheme that takes advantage of the aforementioned factors
concerning DPD functions. This proposed method presents
the most favourable solution to targets set forward by 5G
specifications, namely combined improved hardware efficiency
and latency performance. This method also scales favourably
with increasing nonlinearity order, necessary to linearise
certain high efficiency PAs.
The remainder of the paper is structured as follows: In
Section II a selection of polynomial structures are introduced.
These are the Memory Polynomial (MP) and Generalised
Memory Polynomial (GMP), commonly used DPD structures
due to their compact size and high accuracy. Three existing
evaluation schemes and a fourth novel method are described in
Section III. Results from FPGA implementations are presented
in Section IV for incumbent polynomial techniques and
compared to the novel method proposed in this paper. Section
V concludes the paper.
II. DIGITAL PRE-DISTORTION
Polynomial based DPD is performed by applying a set of
trained coefficients to a polynomial function. The output of
this function is subsequently applied as the input to the PA.













Where y is the PA output signal, x is the PA input signal
and hm(i1, · · · , im) is mth order Volterra kernel . The number
of coefficients used in a function can be calculated as a
combination of the nonlinear order, M , and the number of
memory taps, K. The nonlinear order of a model mitigates
intermodulation distortion. Memory taps account for memory
effects which occur within the PA during operation [10]. The
amount of memory taps used in a DPD structure is often
referred to as its memory depth.
A more compact polynomial structure which can still
account for memory effects is the MP [11]. This DPD
structure achieves good accuracy using a moderate number







hmk · x(n−m) · |x(n−m)|k−1. (2)
The memory depth of the DPD coefficient, h, is denoted
by the value of its subscript. The number of variables in
the coefficient subscript indicates its nonlinear order. These
parameters describe the signal preparation necessary to apply
the coefficient. For instance, h11 requires a second order signal
with a memory depth of two to implement.
A MP model describes a PA using an univariate polynomial.
A memoryless univariate polynomial with a nonlinear order of









n + h0xn. (3)
Polynomial (3), and all other polynomials in this paper, will

















n + h0xn. (4)
It has been shown that odd order coefficients are typically
more prominent, in terms of the nonlinear response that must
be corrected, compared to even order terms. Disregarding even
order coefficients provides a latency and hardware resource
reduction with a slight loss in performance [12]. Dropping the









n + h0xn. (5)
Memory taps can be added to a DPD structure to account
for the memory effects which occur in PAs. This change is
reflected in the pre-distorter with the introduction of additional
coefficients. Adding two memory taps to the polynomial
shown in (5) forms (6).
























To further increase performance cross term coefficients can
also be included in the DPD structure. Structures that include























hmkq · x(n−m) · |x(n−m+ q)|k−1.
(7)
Including cross term coefficients in a DPD structure can
considerably increases the number of DPD coefficients. As
an example, the polynomial shown in (6) can be expanded
to include multivariate cross terms yields the Volterra series
structure as shown in (8).














nxn−1 + · · ·+ h52x5n.
(8)
It can be seen in polynomial (8) that many coefficients
require specific signals or signal combinations in order to be
applied. Delayed signals, such as xn−1 and xn−2, are obtained
using complex shift registers while signal combinations are
acquired using complex multipliers. Signal preparation can
be conducted using either hardware efficient or low latency
orientations.
Using a hardware efficient orientation, second order signals
required by a MP system with a memory depth of two can be
prepared as seen in Fig. 2. This orientation uses complex shift
registers to minimise the use of complex multipliers.








Fig. 2: 2nd order hardware efficient MP signal preparation
The signal preparation illustrated in Fig. 2 can also be
accomplished using a low latency orientation as seen in Fig.
3. This orientation uses complex multipliers to prepare signals
in parallel.






Fig. 3: 2nd order low latency MP signal preparation
Signal preparation becomes more complex for Volterra
structures as the amount of coefficients grows rapidly with
regard to nonlinear order. Hardware efficient preparation of
second order signals in a Volterra structure with a memory










Fig. 4: 2nd order hardware efficient Volterra signal preparation
For multivariate signal preparation using the hardware
efficient orientation, signals of the order M are required to
obtain signals of the order M + 1. Each signal is multiplied
with xn and then shifted, if possible. This results in very
long critical paths for DPD systems of high nonlinear order.
The low latency orientation is performed identically for
multivariate coefficients as it is for univariate coefficients, as
seen in Fig. 3.
Polynomial evaluation schemes can simultaneously reduce
computational overhead and latency costs dramatically. This
is largely accomplished through indirectly applying prepared
signals to coefficients, which will be explored in the next
section.
III. POLYNOMIAL EVALUATION
Existing polynomial evaluation schemes include Horner’s,
Dorn’s and Estrin’s [7]. Optimisations are made by applying
low order signal terms multiple times to higher order
coefficients so that signal preparation costs can be mitigated.
Adders are also used in the application of coefficients instead
of multipliers wherever possible. In the following subsections,
the aforementioned methods and a novel proposed method
are explored individually and an implemented example of the
polynomial in (4) is provided for each.
A. Horner’s Method
Horner’s method [8] is a sequential polynomial evaluation
method that obtains the optimal solution in terms of hardware
requirements [9]. Due to its structure of sequential operations
this evaluation scheme can be adapted for multivariable and
multivariate polynomials. However, implementations using
Horner’s method tend to have longer critical paths compared to
other methods, presenting high latency solutions . Polynomials
expressed using Horner’s method take the form as shown in
(9) [7].
y = (((ak · x+ ak−1) · x+ ak−2) · x+ · · ·+ a0). (9)










0)xn + h0)xn. (10)
A hardware implementation of (4) using Horner’s method










Fig. 5: Horner’s method
B. Dorn’s Method
Dorn’s method [13] is a parallelised adaptation of Horner’s
method that offers improved speed performance at the cost of
additional hardware resources. Polynomials expressed using
Dorn’s method take the form as shown in (11) [7].
y = q0 + q1x + · · · + qk−1xk−1. (11)
Where,
q0 = a0 + akx
k + a2kx
2k + · · · ,
q1 = a1 + ak+1x
k + · · · ,
...
qk−1 = ak − 1 + ak + k − 1xk + · · · .
(12)
Authorized licensed use limited to: Maynooth University Library. Downloaded on March 16,2021 at 17:37:28 UTC from IEEE Xplore.  Restrictions apply. 



















A hardware implementation of (4) using Dorn’s method is







Fig. 6: Dorn’s method
C. Estrin’s Method
Estrin’s method [14] is a parallel polynomial evaluation
scheme that focuses on preparing even-order signal terms
(x2, x4 etc.). This process creates low latency, but hardware
inefficient, implementations. Polynomials expressed using
Estrin’s method take the form (14) [7].
y = qx(
n











2 + · · ·+ a0.
(15)
Polynomial (4) can be represented using this method as can













n + h0xn. (16)
A hardware implementation of (4) using Estrin’s method is










Fig. 7: Estrin’s method
D. Proposed Method
The proposed method, unlike the alternatives, factorises
polynomials such that even and odd order coefficients are
isolated from each other along parallel paths. In the event
that even order coefficients are pruned from a polynomial,
entire paths are removed from the implementation. As this is
a common operation in DPD functions, the proposed method
presents a combined hardware efficient and low latency
solution. This Polynomials expressed using the proposed
method take the form (17).





3 + · · ·+ (ak+2x2 + ak)xk.
(17)


















n + h0xn. (18)
A hardware implementation of (4) using the proposed










Fig. 8: Proposed method
IV. FPGA IMPLEMENTATIONS
As optimisations of polynomial functions, the discussed
polynomial evaluation methods can be easily adapted to FPGA
implementations, and have been in previous publications [15].
The following graphs, Figs. 9, 10, 12 and 13, describe the
hardware requirements and critical path lengths of several
implementations of polynomials using the described evaluation
methods. These pipelined implementations were comprised
of sixteen bit complex multipliers, adders and shift registers.
Pipelining is performed to increase throughput, allowing
multiple instructions to be performed per each clock cycle,
and implemented by inserting pipeline registers in each distinct
sub-stage of the datapaths [16].
Hardware requirements were measured in the amount of
DSP48E1 slices used. Latency was gauged as critical path
lengths and measured in terms of the number of pipeline
registers contained on the path. The DSP48E1 is a type of DSP
slice, programmable arithmetic units found in Xilinx FPGAs
that enable fast digital signal processing [17].
Authorized licensed use limited to: Maynooth University Library. Downloaded on March 16,2021 at 17:37:28 UTC from IEEE Xplore.  Restrictions apply. 
Authorized licensed use limited to: Maynooth University Library. Downloaded on March 16,2021 at 17:37:28 UTC from IEEE Xplore.  Restrictions apply. 
Authorized licensed use limited to: Maynooth University Library. Downloaded on March 16,2021 at 17:37:28 UTC from IEEE Xplore.  Restrictions apply. 
