Fast FPGA emulation of analog dynamics in digitally-driven systems by Herbst, Steven et al.
Fast FPGA Emulation of Analog Dynamics in Digitally-Driven
Systems
Steven Herbst
Stanford University
sherbst@stanford.edu
Byong Chan Lim
Stanford University
bclim@stanford.edu
Mark Horowitz
Stanford University
horowitz@stanford.edu
ABSTRACT
In this paper, we propose an architecture for FPGA emulation of
mixed-signal systems that achieves high accuracy at a high through-
put. We represent the analog output of a block as a superposition
of step responses to changes in its analog input, and the output
is evaluated only when needed by the digital subsystem. Our ar-
chitecture is therefore intended for digitally-driven systems; that
is, those in which the inputs of analog dynamical blocks change
only on digital clock edges. We implemented a high-speed link
transceiver design using the proposed architecture on a Xilinx
FPGA. This design demonstrates how our approach breaks the link
between simulation rate and time resolution that is characteris-
tic of prior approaches. The emulator is flexible, allowing for the
real-time adjustment of analog dynamics, clock jitter, and various
design parameters. We demonstrate that our architecture achieves
1% accuracy while running 3 orders of magnitude faster than a
comparable high-performance CPU simulation.
CCS CONCEPTS
• Hardware→ Simulation and emulation;
KEYWORDS
FPGA emulation, mixed-signal circuits, verification
1 INTRODUCTION
Top-level simulation is a crucial part of the verification of today’s
complex chips. For entirely digital designs, FPGA emulation can
provide a significant performance boost; gains of 100,000x as com-
pared to CPU simulation have been reported [1]. However, for
systems containing mixed-signal components, as most SoCs do
today, emulating analog behavior poses a special challenge: not
only does one need to create functional models for analog blocks,
but those models must be written in a way that can be implemented
on an FPGA.
While there have been many approaches for functional model-
ing of analog blocks in a digital validation environment, for exam-
ple using s-domain models [4, 5], piecewise-linear waveforms [7],
and mixed-mode simulation [6], these methods do not map easily
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
ICCAD ’18, November 5–8, 2018, San Diego, CA, USA
© 2018 Copyright held by the owner/author(s). Publication rights licensed to the
Association for Computing Machinery.
ACM ISBN 978-1-4503-5950-4/18/11. . . $15.00
https://doi.org/10.1145/3240765.3240808
onto an FPGA. Instead, prior work in mixed-signal emulation has
represented analog blocks using oversampled discrete-time mod-
els [2, 3, 8, 11]. These models are implemented as infinite impulse
response (IIR) and finite impulse response (FIR) filters, and once
their values are quantized, the resulting discrete-time, discrete-
value digital filters can be directly mapped onto an FPGA. While
this approach enables emulation of analog circuits, it unfortunately
links simulation accuracy to the time step used for the analog blocks.
For systems that use high-speed links, fine time resolution is re-
quired to model jitter, meaning that the emulator time step must be
much shorter than the shortest clock period in the system, wasting
resources and slowing down the system emulation.
To avoid these issues, this paper demonstrates an alternate ap-
proach that does not rely on oversampled models, providing accu-
rate emulation results while only using the existing clocks of the
system.
Our approach leverages the fact that most analog blocks in digital
systems have inputs that were originally created by another digital
block (e.g., a link transmitter, DAC, etc.) before being processed by
analog circuits. Thus, in addition to using an event-driven, variable
time step approach like [7], we further accelerate emulation by elim-
inating the need to create internal analog events. Instead, emulated
time progresses directly from one emulated clock edge to the next.
Analog outputs are computed as a superposition of step responses
to changes in analog inputs that are digitally-driven, meaning that
they change only on digital clock edges. We demonstrate that the
accuracy of our approach is independent of the time step.
The proposed architecture is presented in Section 2, followed by
an implementation example in Section 3 for an 8GT/s high-speed
link transceiver. Measured performance of our architecture on a
Xilinx FPGA is reported in Section 4, and Section 5 covers possible
extensions to handle nonlinearity and a broader class of inputs.
2 ARCHITECTURE
Figure 1 shows our emulation architecture, in which a digital sub-
system interacts with analog blocks through an analog dynamics
engine (ADE) and a time manager. The ADE transforms digitally-
driven analog inputs into analog outputs, while the time manager
determines the time associated with each emulator cycle and gen-
erates clock edges for the digital subsystem. The architecture is
intended to be implemented entirely by the programmable logic of
an FPGA, avoiding the need for an analog daughtercard.
2.1 Analog Dynamics Engine
In a conventional mixed-signal emulator, each emulation cycle
corresponds to a fixed time step ∆t , with analog dynamics mapped
to hardware as FIR and/or IIR filters. However, the accuracy of
such an approach generally worsens as ∆t increases. For example,
ar
X
iv
:2
00
2.
02
07
2v
1 
 [c
s.A
R]
  6
 Fe
b 2
02
0
ICCAD ’18, November 5–8, 2018, San Diego, CA, USA Steven Herbst, Byong Chan Lim, and Mark Horowitz
Clock 1
emulated time output edge(s)
next edge time
minimum
Clock 2
emulated time output edge(s)
next edge time
Clock N
emulated time output edge(s)
next edge time
emulated
clocks
Time Manager
Digital
Subsystem
emulated time
Analog 
Dynamics 
Engine
Analog Control
Clock Control
digital control
digital control
digital control
Figure 1: The proposed emulator architecture. A digital subsystem generates piecewise-constant analog inputs for an analog
dynamics engine (ADE) whose outputs are computed as superpositions of pulse responses. A time manager moves emulated
time forward from one emulated clock edge to the next and generates clocks for the digital subsystem.
x (t) ∗ f (t)x (t)
f (t)
x1
x2
x3
t1 t2 t3 t4
* = =Impulse Response
Piecewise-Constant Input Output
+
+
x1 · (F (t− t1)− F (t− t2))
x2 · (F (t− t2)− F (t− t3))
x3 · (F (t− t3)− F (t− t4))
Summation of Pulse Responses
Figure 2: The response of a linear time-invariant (LTI) system to a piecewise-constant input can be computed as a summation
of pulse responses, each dependent on two values of the system’s step response F (t).
if Euler’s method is used to generate the filter coefficients, the
global truncation error (GTE) is O (∆t), while the GTE associated
with the trapezoid rule is O (∆t2) [10]. As a result, there is a trade-
off between emulation rate and accuracy: high accuracy can be
achieved, but at a low throughput, and vice versa.
In our architecture, every emulation cycle corresponds to one or
more digital clock edges. Under the assumption that analog blocks
are digitally-driven, these are the exact times at which analog inputs
might change. The ADE therefore has a complete history of the
precise times and values of analog input steps, enabling each analog
output to be computed as a summation of step responses.
Our approach assumes that the analog blocks being modeled
are linear and time-invariant (LTI). Section 3 describes how this
approach can be simply extended to handle time-variant systems,
and Section 5 discusses how nonlinearity could be handled.
It is well known that an LTI system is characterized by an impulse
response f (t), with its output y (t) equal to the convolution of its
input x (t) with f (t):
y (t) = f (t) ∗ x (t) =
∫ ∞
−∞
f (τ ) · x (t − τ ) dτ (1)
When the input changes only on digital clock edges, the integral
can be reduced to a summation. Suppose that x (t) is a general
piecewise-constant function:
x (t) =
∑
k
xk · (u (t − tk ) − u (t − tk+1)) (2)
where u (t) is the unit step function. Substituting Equation 2 into
Equation 1 yields:
y (t) =
∑
k
xk ·
∫ t−tk
t−tk+1
f (τ ) dτ
=
∑
k
xk · (F (t − tk ) − F (t − tk+1))
(3)
Fast FPGA Emulation of Analog Dynamics in Digitally-Driven Systems ICCAD ’18, November 5–8, 2018, San Diego, CA, USA
where F (t) is the system’s step response:
F (t) =
∫ t
0
f (τ ) dτ (4)
Equation 3 can be interpreted as a summation of pulse responses
weighted by the sequence of input values xk , as illustrated in Fig-
ure 2. Since each pulse response is simply the difference between
two values of F (t), the step response can be precomputed once and
subsequently used to calculate the system output for any piecewise-
constant input.
As long as the analog input is digitally-driven and the system is
LTI, Equation 3 is exact, meaning that the accuracy of our approach
does not depend on the size of the time steps taken by the emu-
lator. Notice that since our system makes no assumptions about
the width of input pulses, the effects of jitter on the clocks driving
analog blocks will automatically be included in the analog outputs.
In addition, there is no need to approximate analog dynamics by a
rational transfer function, as in [5], since our architecture makes
direct use of the system’s step response. This is particularly conve-
nient when working with measured frequency response data such
as the S-parameter model of a backplane channel [9].
2.2 Time Manager
The time manager has two tasks: 1) it assigns a time to each em-
ulation cycle, and 2) it generates clock edges for emulated blocks.
To maximize the emulation throughput, its objective is to take the
largest time step possible without skipping over any clock edges.
In our architecture, every emulated clock stores the time of its
next clock edge, and the minimum of these times is selected for the
next emulation cycle. Any clock whose edge is to occur at that time
will output the edge and update the time of its next edge.
Our time manager design ensures that at least one clock edge
is generated in every emulation cycle; there are no “analog-only”
cycles. As a result, the emulation rate of the entire mixed-signal
system is similar to that of the digital subsystem alone.
3 IMPLEMENTATION
We implemented our emulator architecture for an 8GT/s high-
speed link transceiver design. As shown in Figure 3, our emula-
tor has adjustable transmitter (TX) and receiver (RX) equalization,
with a clock and data recovery (CDR) loop that is closed through
a bang-bang phase detector (BBPD) and digitally-controlled oscil-
lator (DCO). A decision-feedback equalizer (DFE) helps to reduce
intersymbol interference (ISI).
3.1 Precomputation of Analog Dynamics
The ADE implements the combined analog dynamics of the channel
and the continuous-time linear equalizer (CTLE). Channel dynam-
ics were computed from S-parameter measurements [9], while the
CTLE transfer function was based on the PCIE specification, con-
sisting of two fixed poles and an adjustable zero to support adaptive
equalization.
In our implementation, the CTLE zero can be positioned in one
of 16 different settings between 0.4-2.0 GHz, meaning that the CTLE
is not time-invariant. To handle this, the ADE stores a family of
precomputed step responses, each representing the combined dy-
namics of the channel and CTLE in one of its settings (Figure 4).
During each emulation cycle, the ADE selects the appropriate step
response based on the current CTLE setting.
The dynamics of transitions between CTLE settings is not cap-
tured by this approach. However, it is often unnecessary to model
these transitions in detail, since they are typically much shorter
than the amount of time that the adaptive equalization algorithm
spends in a given CTLE setting.
3.2 Analog Dynamics Engine
As shown in Figure 5, the ADE is implemented as an array of taps,
each containing a step response lookup table. Following Equation 3,
the ADE output is a weighted sum of its input history, with each
weight computed as the difference in step response value between
two neighboring ADE taps.
Since the ADE has a finite number of taps, the implementation
effectively truncates Equation 3. For our system, 85 taps was suffi-
cient to limit the truncation error to a few tenths of a percent. As a
guideline, the number of taps should be approximately the ratio of
the step response settling time to the TX clock period, independent
of the emulation time resolution.
3.3 Compressing Step Response Data
To reduce the memory footprint of the ADE, each of its taps is
compressed by trimming its domain and by using piecewise-linear
approximations.
3.3.1 Domain Trimming. In general, each ADE tap must evalu-
ate the step response over a different timespan. For example, the
first tap in our ADE reads out step response values near t = 0 ns
while the last tap reads out values near t = 10 ns. More generally,
the kth ADE tap will evaluate the step response at a time between
(k−1) and k periods of the TX clock.
Taking advantage of this property, we trim the step response
lookup tables individually so that taps store only the data they may
actually need. In our emulator, the TX clock has a nominal period
TTX and a period jitter that is uniformly distributed between −JTX
and +JTX , so its period is guaranteed to lie between TTX − JTX
and TTX + JTX . Hence, the domain of the kth lookup table can be
trimmed to the following interval:
(k−1) · (TTX − JTX ) ≤ t − tk ≤ k · (TTX + JTX ) (5)
If the jitter distribution were unbounded, domain trimming
would require an approximate time range to be determined for
each tap. For example, suppose that the TX clock periods were mod-
eled as independent and identically distributed Gaussian random
variables with mean TTX and standard deviation σTX . The sum of
n periods would then be a Gaussian random variable with mean
nTTX and standard deviation σTX
√
n.
Hence, with high probability (1− 2 · 10−9), the duration of n
TX periods would be nTTX ± 6σTX√n. It would therefore likely
be acceptable to trim the domain of the kth lookup table to the
following interval:
(k−1)TTX −6σTX
√
k−1 ≤ t − tk ≤ kTTX +6σTX
√
k (6)
ICCAD ’18, November 5–8, 2018, San Diego, CA, USA Steven Herbst, Byong Chan Lim, and Mark Horowitz
RX CTLE 
Setting
TX FFE 
Setting
Clock Jitter
Characteristics
PRBS 3-tapTX FFE
4
Digital Loop 
Filter
rx_clk_p
Bang-Bang 
Phase 
Detector
rx_clk_p
rx_clk_p
down
up
14
Σ
2-tap
RX DFE rx_clk_p
rx_data
tx_clk
tx_data
tx_clk
tx_jitter_scale
rx_period
rx_jitter_scale
tx_cke
clk
tx_clk
rx_clk_p
rx_clk_n
time
time_manager
cke
clk
value
time
out
analog_dynamics
setting
time
4
channel_in
tx_cke
clk
ctle_out
rx_clk_n
samp_n
samp_p
kp
ki
init
dco_code
PI Loop Filter
Parameters
in
pwl_table
clk
out
clk
RX DCO
Transfer Function
Combined Analog Dynamics
of Channel and RX CTLE
rx_period
tx_cke
clk
tx_clk
rx_clk_p
rx_clk_n
time
pcb_clk_p
pcb_clk_n
200 MHz differential clock
from ZC706 PCB
Figure 3: Block diagram of our FPGA emulator for a high-speed link transceiver. The digitally-driven channel_in signal is
the input to the analog dynamics engine (ADE), which implements the combined dynamics of the channel and an adjustable
continuous-time linear equalizer (CTLE). The time_managermodule controls the flow of time in the emulator and determines
which clocks are active in each emulation cycle. The nonlinear transfer function of the digitally-controlled oscillator (DCO)
is implemented by a piecewise-linear lookup table. Circled settings are adjustable in real-time by users.
0 2 4 6 8 10
Time (ns)
0.0
0.2
0.4
0.6
0.8
1.0
Va
lu
e
Combined Channel and CTLE Step Responses
Figure 4: Family of step responses representing analog dy-
namics in the high-speed linkmodel. Each step is the convo-
lution of a channel step response, computed frommeasured
S-parameters, with the impulse response of the RX CTLE in
one of 16 different settings. In our FPGA emulator, the CTLE
dynamics setting can be changed in real-time,without repro-
gramming the FPGA.
In the exceptional case that an ADE tap must evaluate the step
response outside of this range, it could do so by extrapolation or
by reading from a neighboring tap.
3.3.2 Piecewise-Linear Approximation. We use piecewise-linear
(PWL) lookup tables to store step response data in a memory-
efficient manner. As illustrated in Figure 6, step responses were
approximated by a sequence of line segments whose offsets and
slopes were stored in lookup tables. For each ADE tap, the number
of PWL segments was determined by starting with two segments
and iteratively doubling the number of segments until the error
in the PWL representation was less than 0.1 %. Within each itera-
tion, linear programming was used to determine an optimal PWL
representation.
While not done in this current implementation, a multivariate
PWL representation could be used to further reduce the memory
overhead of an adjustable model by interpolating between its set-
tings instead of storing the step response of each one.
3.4 Time Manager
There are two emulated clocks in our system, one for the transmitter
and one for the receiver. The RX clock has two output phases, since
the BBPD uses both edges of the RX clock, while the TX clock has
only one output phase to represent its rising edge.
Both clocks are implemented using instances of the module
shown in Figure 7. During each emulation cycle, the clock module
compares the emulation time, time_in, to the time of its next clock
edge, time_out. If the two match, it asserts one of the cke_out
signals and increments time_out. The increment includes jitter,
which is implemented by scaling the output of a linear feedback
shift register (LFSR).
The cke_out signals are used to generate clock signals for each
clock phase by gating a free-running 30MHz emulator clock on
a cycle-by-cycle basis. The gating itself is performed by a Xilinx
Mixed-Mode Clock Manager (MMCM) IP Block, which ensures
glitch-free gating and low inter-clock skew.
Fast FPGA Emulation of Analog Dynamics in Digitally-Driven Systems ICCAD ’18, November 5–8, 2018, San Diego, CA, USA
cke_d
cke
clk
setting
d q
cke
d q
in
setting
clk
out
pwl_table
Σtime
+ -
clk
d q
cke
Step Response
PWL Table 1
X
cke_d
clk
prod[0]
setting
d q
cke
in
setting
clk
out
pwl_table
Σ
+ -
d q
cke
Step Response
PWL Table 2
X
Σ
cke
clk
Σ
prod[1]
pulse[1]
- +step[0]
value_hist[0]
time_hist[0]
clk
time_hist[1]
step[1]
value
in
setting
clk
out
pwl_table
Σ
+ -
d q
cke
Step Response
PWL Table n
X
Σ
cke
clk
Σ
prod[n-1]
pulse[n-1]
- + step[n-1]
time_hist[n-1]
value_hist[n-2]
sum[n-2]sum[0]
time
clk clk
setting
time time
out
cke
clk
cke_d
Tap 2 Tap nTap 1
analog_dynamics
Figure 5: Implementation of the analog dynamics engine (ADE). Each pulse response is computed as the difference between
two step response values, which depend on the timing of step changes in the analog input. The pulse responses are weighted
by values from the input history and summed to form the ADE output. Our emulator uses 85 taps to store an input history
spanning about 10 ns, with each tap corresponding to a unit interval (UI) of the link.
b1
b2
b3
b4
(τ1, a1)
(τ2, a2)
(τ3, a3)
(τ4, a4)
Figure 6: Our emulator uses piecewise-linear lookup tables
to store step responses and nonlinear functions in a com-
pressed form. Each time point τk in the lookup table is as-
signed an offset ak and a slope bk .
4 RESULTS
In this section, we discuss measurements of emulator throughput,
resource utilization, and accuracy. The Xilinx ZC706 board, which
features a Xilinx Zynq-7045 FPGA, was used in these evaluations.
Compiling our high-speed link emulator for this platform took
11min from synthesis to bitstream generation using Vivado 2016.4.
4.1 Emulation Throughput
Table 1 summarizes the performance of our high-speed link emula-
tor in comparison to optimized CPU simulations of similar systems.
Since the emulator clock rate is 30MHz, and three emulator clock
d q
cke
==
Σ
clk
rotate
mask[n-1:0]
time_eq
mask[0]
mask[1]
mask[n-1]
cke_out[0]
cke_out[1]
cke_out[n-1]
period
x
jitter_scale
time_in
time_out
ckeout
LFSR clk
time_eq
clock_module
d q
cke
Figure 7: Clock module used to model TX and RX clock be-
havior. During each emulation cycle, time_in is compared to
time_out. If theymatch, time_out is incremented and one of
the cke_out signals is asserted. The clock phase is advanced
by rotating mask, and jitter is implemented by adding the
scaled output of an LFSR to the period input.
cycles are required to process each unit interval (UI), the emulator
throughput is 107 UI/s, or 1.25ms/s.
4.1.1 Performance Comparison. The CPU simulation rate of a
similar high-speed link written in SystemVerilog as a fast functional
model for validation was 0.741-1.429 µs/s [7]. In this model, an
event-driven approach was used, with analog dynamics represented
as a sum of linear filters and waveforms represented by piecewise-
linear segments. Our emulation system is at least 875x faster than
this comparable, high-performance CPU simulation.
A different fast CPU simulation approach based on s-domain
modeling has also been reported [5]. In that case, a high-speed link
ICCAD ’18, November 5–8, 2018, San Diego, CA, USA Steven Herbst, Byong Chan Lim, and Mark Horowitz
Table 1: Emulator performance in comparison to optimized
simulations of similar systems.
Design [5] [7] This Work
Type CPU CPU FPGA
Simulation Rate 0.129 µs/s 1.429 µs/s 1.250 ms/s
Speedup vs. [5] 1x 11x 9,690x
Table 2: Resource utilization of the entire emulator.
Resource Count Available Percent Utilization
LUT 11,392 218,600 5.2 %
FF 8,076 437,200 1.8 %
BRAM 93 545 17.1 %
DSP 141 900 15.7 %
Table 3: Resource utilization of the analog dynamics engine
(ADE).
Resource Count Available Percent Utilization
LUT 8,085 218,600 3.7 %
FF 3,475 437,200 0.8 %
BRAM 36 545 6.6 %
DSP 138 900 15.3 %
was implemented using a 50-pole channel model, and a simulation
rate of 0.110-0.129 µs/s was achieved. In comparison, our emulation
throughput is at least 9,690x higher.
In order to isolate the speedup in running our emulator on an
FPGA, we measured the performance of a CPU-based simulation of
the emulator’s SystemVerilog code. This simulation was run using
Cadence Xcelium 18.03 on an Intel Xeon E5645 CPU (2.4 GHz) with
96GB RAM; we measured a simulation rate of 0.192 µs/s. Hence,
our architecture runs 6,510x faster on an FPGA than on a multi-core
CPU-based simulator.
4.2 Resource Utilization
Table 2 shows the FPGA resource utilization of the entire emulation
system, demonstrating that no more than 17.1 % of any resource
was needed. As a result, there would be ample room left over to
emulate a larger digital subsystem. For example, one could build a
more complete multi-lane SERDES system including equalization
adaptation logic and PCS (Physical Coding Sublayer).
Table 3 summarizes the resource utilization of the ADE alone.
Note that only 39 % of block RAM (BRAM) in the emulator is used
by ADE; the remaining BRAM tiles are consumed by Integrated
Logic Analyzer (ILA) IP blocks to capture internal waveforms.
4.2.1 LUTs and FFs. To put the LUT and FF utilization in per-
spective, one XilinxMicroBlaze soft processor consumes 2,071 LUTs
and 1,672 FFs with typical settings on our FPGA [12]. Hence, our
emulator has a LUT and FF footprint equivalent to about five or six
MicroBlaze cores. Approximately 100 more such cores could fit in
the resources remaining after instantiating our emulator.
0 20 40 60 80
Tap #
0
10
20
30
40
50
60
Re
qu
ire
d 
RO
M
 S
ize
 (k
b)
0.5 BRAM
1.0 BRAM
1.5 BRAM
Bits Requirement for Step Response Storage
Figure 8: Number of bits required to store step response data
for each tap in the analog dynamics engine (ADE). Memory
on the Xilinx Zynq-7045 FPGA used in our implementation
can be allocated in increments of a half BRAM tile (18 kb)
as indicated by the dashed lines. Only 8 % of taps required
more than the minimum increment.
4.2.2 BRAM. Our individual optimization of the ADE taps re-
duced the BRAM tile requirement by more than 22.5x, from 2,097
(which would not have fit on our FPGA) to 93. As a result, the
largest consumers of BRAM in our emulator are the ILA blocks,
rather than the PWL tables in the ADE.
Figure 8 shows the optimized lookup table size for each ADE tap.
More memory is required to represent parts of the step response
that are rapidly varying, such as the area around t = 4 ns in Figure 4.
However, the majority of tables (92 %) require no more than a half
BRAM tile, which is the minimum unit that can be allocated on our
FPGA.
4.2.3 DSP. Most (97.9 %) of the DSP utilization is attributed to
the ADE. Each of its 85 taps has two multipliers: one to implement
its PWL table, and one to weight its input value by the correspond-
ing pulse response. Since each DSP slice in our FPGA contains
a single multiplier, the expected number of DSP slices consumed
by the ADE is therefore around 170. The actual DSP utilization is
18.3 % lower, since some low-precision multiplications synthesized
to LUTs.
4.3 Emulation Accuracy
The accuracy of our emulator is evaluated in two ways. First, the
waveforms of our FPGA emulation are directly compared with those
of a CPU simulation. Second, we compare high-level behavioral
metrics between the two approaches.
Compared to the CPU simulation, there are several potential
sources of error in the emulation which include: 1) quantization
error due to a fixed-point representation, 2) PWL approximation
error in representing step responses, and 3) truncation error due to
the finite number of pulses used.
Fast FPGA Emulation of Analog Dynamics in Digitally-Driven Systems ICCAD ’18, November 5–8, 2018, San Diego, CA, USA
11 12 13 14 15 16
Time (ns)
0.6
0.4
0.2
0.0
0.2
0.4
0.6
Va
lu
e
Transient Accuracy
CPU
FPGA
Figure 9: Comparison of FPGA emulation and CPU simu-
lation waveforms, shown at the output of the RX CTLE.
Our emulator evaluates analog signals only on digital clock
edges, and these samples are shown as dots. The continuous-
time waveform was calculated using CPU simulation and is
shown as a solid curve. Over all 160 combinations of TX and
RX equalizer settings, the worst-case relative error observed
was -0.7%/+1.1%.
4.3.1 Transient Accuracy. The ADE output was captured by the
ILA and compared with the corresponding output from a CPU
simulation. The waveform comparison was conducted over all 160
different configurations of TX and RX equalization settings. In each
configuration, emulated and simulated waveforms were compared
over a 128 ns duration (1024 UIs). The worst-case relative error1
observed was -0.7/+1.1%. Figure 9 shows a sample comparison of
the transient waveforms.
4.3.2 Behavioral Accuracy. Figure 10 shows a comparison of
the CDR startup waveforms from our emulator and from a CPU
simulation. They are in generally good agreement, with 10 % set-
tling times of 513 ns and 524 ns for the emulation and simulation,
respectively. Figure 11 compares amplitude histograms at the DFE
output. The standard deviation about “0” and “1” levels was 39.2mV
for the emulation and 37.4mV for the simulation. Note that the
CPU simulations used in these two comparisons did not include
clock jitter.
5 EXTENSIONS
In this section, we describe possible extensions to model nonlinear-
ity and handle a broader class of analog input signals.
5.1 Handling a Broader Class of Inputs
Up until this point, we have used the term digitally-driven to refer to
analog input signals that change only on digital clock edges, mean-
ing that they are that are piecewise-constant. However, it is possible
1Relative error is defined as max |yFPGA − yCPU | /max |yCPU |, where yFPGA and yCPU
are the FPGA and CPU waveforms, respectively.
0 200 400 600 800 1000
Time (ns)
2000
4000
6000
8000
DC
O 
Co
de
FPGA Emulation
0 200 400 600 800 1000
Time (ns)
CPU Simulation
Figure 10: Startup transient of the high-speed link
transceiver captured by the FPGA emulator. The RX
DCO code is shown over the first 1 µs of operation, starting
from an initial value of 1000 (7.6 GHz) and settling around
8192 (8.0 GHz).
0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25
Voltage
0
1
2
3
4
5
Pr
ob
ab
ilit
y 
De
ns
ity
DFE Output Histogram
CPU
FPGA
Figure 11: Amplitude histogram at the DFE output, con-
structed from emulator data. The dashed curve represents
the distribution of comparable data gathered in a CPU sim-
ulation.
to broaden this definition to include piecewise-polynomial wave-
forms, which can at least approximately represent any arbitrary
analog signal.
Suppose that an analog signal is comprised of polynomial seg-
ments of degree n:
x (t) =
∑
i
n∑
j=0
xi j · (t − ti )j · (u (t − ti ) − u (t − ti+1)) (7)
Assuming that this signal is supplied as input to a system with
an impulse response f (t), the resulting output will be:
y (t) =
∑
i
n∑
j=0
xi j ·
∫ t−ti
t−ti+1
f (τ ) · (t − ti − τ )j dτ (8)
ICCAD ’18, November 5–8, 2018, San Diego, CA, USA Steven Herbst, Byong Chan Lim, and Mark Horowitz
Applying the Binomial Theorem yields:
y (t) =
∑
i
n∑
j=0
xi j
t−ti∫
t−ti+1
f (τ )
j∑
k=0
(
j
k
)
(t − ti )j−k (−τ )k dτ
=
∑
i
n∑
k=0
cik (t) · (Fk (t − ti ) − Fk (t − ti+1))
(9)
where
cik (t) =
n∑
j=k
xi j ·
(
j
k
)
· (t − ti )j−k
Fk (t) =
∫ t
0
f (τ ) · (−τ )k dτ
(10)
The n step response-like functions Fk could be precomputed
for use during emulation and, as before, these functions could be
implemented using PWL tables.
5.2 Modeling Nonlinearity
Memoryless nonlinearities occurring at the input or output of an
analog block are straightforward to model in our architecture, since
they can be implemented by lookup tables outside of the ADE.
For example, in our high-speed link implementation, the transfer
function from the DCO code n to the RX clock period TRX is given
by the nonlinear relation TRX = 1/(α + βn). We used a small (160
bit) PWL lookup table to implement this behavior.
Our digitally-driven approach could also be used tomodel a block
governed by the first-order nonlinear dynamics Ûy = д (x ,y), where
x is its piecewise-constant input and y is its output. Assuming that
x is constant from t1 to t2, the value of y at the end of that time
interval can be written as a function of the interval length and the
values of x and y at the beginning of the interval:
y (t2) = G (t2 − t1,x (t1) ,y (t1)) (11)
In principle, the functionG could be precomputed for use during
emulation. While this approach might not always be practical, there
are at least two cases that lend themselves to efficient implementa-
tion.
First, if the system’s differential equation can be solved analyti-
cally for a constant input, then G may have a convenient closed-
form expression. For example, an integrator that saturates to ±1
could be represented by:
y (t2) = min (max (y (t1) + (t2 − t1) · x (t1) ,−1) , 1) (12)
Second, if the input is restricted to certain discrete values, as in
a high-speed link, then it may be possible to represent G using a
small number of precomputed trajectories.
As an example, consider a filter governed by the nonlinear dy-
namics Ûy = (x − y) /τ (y), with τ positive. Assuming that the input
is limited to values of −1 and 1, there are effectively two unique out-
put trajectories: one starting at 1 and decreasing towards −1 (with
input −1), and the other starting at −1 and increasing towards 1
(with input 1). These two trajectories could be precomputed for use
during emulation, so that whenever the input value changes, the tra-
jectory corresponding to the new input value could be played back
starting from the output value at the time of the input transition.
6 CONCLUSION
In this paper, we described an FPGA architecture for emulating a
mixed-signal system with a digitally-driven analog block; that is,
one whose input changes only on digital clock edges. The analog
output of such a block is computed as a weighted sum of pulse
responses, and is calculated in a way that allows the emulator to
progress directly from one digital clock edge to the next. Unlike
a conventional oversampled approach, the emulator’s accuracy is
independent of time step size.
Using an 8GT/s high-speed link transceiver as an example, we
implemented the proposed architecture on a Xilinx Zynq-7045
FPGA. The emulation rate achieved was 1.250ms/s, which rep-
resents an 875x improvement over a high-performance CPU simu-
lation of a similar system. The worst-case error observed in com-
parison to an idealized CPU computation of the analog dynamics
was -0.7/+1.1%.
We conclude that the proposed architecture is appropriate for ver-
ifying the behavior of mixed-signal systems over long time scales,
where CPU simulation would be impractically time-consuming.
Owing to its low resource utilization and ability to handle multiple
clock domains, we expect that it will scale well to large designs.
ACKNOWLEDGMENT
This work is supported by National Science Foundation Grant No.
1509126, a Hertz Foundation Fellowship, and a Stanford Graduate
Fellowship.
REFERENCES
[1] Sameh Asaad, Ralph Bellofatto, Bernard Brezzo, Chuck Haymes, Mohit Kapur,
Benjamin Parker, Thomas Roewer, Proshanta Saha, Todd Takken, and José Tierno.
2012. A Cycle-accurate, Cycle-reproducible multi-FPGA System for Accelerating
Multi-core Processor Simulation. In Proceedings of the ACM/SIGDA International
Symposium on Field Programmable Gate Arrays. ACM, 153–162.
[2] Timon Bruckner, Matthias Lorenz, Christoph Zorn, Joachim Becker, Wolfgang
Mathis, and Maurits Ortmanns. 2012. Hardware-Accelerated Simulation Envi-
ronment for CT Sigma-Delta Modulators Using an FPGA. IEEE Transactions on
Circuits and Systems II: Express Briefs 59, 8 (July 2012), 471–475.
[3] B. S. Deepaksubramanyan, P. Parakh, Zhenhua Chen, H. Diab, D. Marcy, and
F. H. Schlereth. 2005. An FPGA-based MOS circuit simulator. In 48th Midwest
Symposium on Circuits and Systems. 655–658.
[4] J. E. Jang, M. Park, and J. Kim. 2013. An event-driven simulation methodology for
integrated switching power supplies in SystemVerilog. In 50th ACM/EDAC/IEEE
Design Automation Conference (DAC).
[5] J. E. Jang, M. J. Park, D. Lee, and J. Kim. 2012. True event-driven simulation of
analog/mixed-signal behaviors in SystemVerilog: A decision-feedback equalizing
(DFE) receiver example. In Proceedings of the IEEE 2012 Custom Integrated Circuits
Conference.
[6] Ken Kundert and Olaf Zinke. 2013. The Designer’s Guide to Verilog-AMS. Springer
Publishing Company, Incorporated.
[7] B. C. Lim and M. Horowitz. 2016. Error Control and Limit Cycle Elimination in
Event-Driven Piecewise Linear Analog Functional Models. IEEE Transactions on
Circuits and Systems I: Regular Papers 63, 1 (Jan. 2016), 23–33.
[8] Frank Austin Nothaft, Luis Fernandez, Stephen Cefali, Nishant Shah, Jacob Rael,
and Luke Darnell. 2014. Pragma-based Floating-to-fixed Point Conversion for
the Emulation of Analog Behavioral Models. In Proceedings of the 2014 IEEE/ACM
International Conference on Computer-Aided Design. 633–640.
[9] Sam Palermo. 2017. 12" Backplane S-Parameter Data. (2017). http://www.ece.
tamu.edu/~spalermo/ecen689/peters_01_0605_B12_thru.s4p
[10] E. Süli and D.F. Mayers. 2003. An Introduction to Numerical Analysis. Cambridge
University Press.
[11] Philipp Tertel and Lars Hedrich. 2017. Real-time emulation of block-based analog
circuits on an FPGA. In 14th International Conference on Synthesis, Modeling,
Analysis and Simulation Methods and Applications to Circuit Design (SMACD).
[12] Xilinx. 2018. MicroBlaze Processor Reference Guide. (April 2018).
https://www.xilinx.com/support/documentation/sw_manuals/xilinx2018_
1/ug984-vivado-microblaze-ref.pdf
