In this paper, we propose an architecture for FPGA emulation of mixed-signal systems that achieves high accuracy at a high throughput. We represent the analog output of a block as a superposition of step responses to changes in its analog input, and the output is evaluated only when needed by the digital subsystem. Our architecture is therefore intended for digitally-driven systems; that is, those in which the inputs of analog dynamical blocks change only on digital clock edges. We implemented a high-speed link transceiver design using the proposed architecture on a Xilinx FPGA. This design demonstrates how our approach breaks the link between simulation rate and time resolution that is characteristic of prior approaches. The emulator is flexible, allowing for the real-time adjustment of analog dynamics, clock jitter, and various design parameters. We demonstrate that our architecture achieves 1% accuracy while running 3 orders of magnitude faster than a comparable high-performance CPU simulation.
INTRODUCTION
Top-level simulation is a crucial part of the verification of today's complex chips. For entirely digital designs, FPGA emulation can provide a significant performance boost; gains of 100,000x as compared to CPU simulation have been reported [1] . However, for systems containing mixed-signal components, as most SoCs do today, emulating analog behavior poses a special challenge: not only does one need to create functional models for analog blocks, but those models must be written in a way that can be implemented on an FPGA.
While there have been many approaches for functional modeling of analog blocks in a digital validation environment, for example using s-domain models [4, 5] , piecewise-linear waveforms [7] , and mixed-mode simulation [6] , these methods do not map easily Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. ICCAD '18, November 5-8, 2018 , San Diego, CA, USA © 2018 Copyright held by the owner/author(s). Publication rights licensed to the Association for Computing Machinery. ACM ISBN 978-1-4503-5950-4/18/11. . . $15.00 https://doi.org/10.1145/3240765.3240808 onto an FPGA. Instead, prior work in mixed-signal emulation has represented analog blocks using oversampled discrete-time models [2, 3, 8, 11] . These models are implemented as infinite impulse response (IIR) and finite impulse response (FIR) filters, and once their values are quantized, the resulting discrete-time, discretevalue digital filters can be directly mapped onto an FPGA. While this approach enables emulation of analog circuits, it unfortunately links simulation accuracy to the time step used for the analog blocks. For systems that use high-speed links, fine time resolution is required to model jitter, meaning that the emulator time step must be much shorter than the shortest clock period in the system, wasting resources and slowing down the system emulation.
To avoid these issues, this paper demonstrates an alternate approach that does not rely on oversampled models, providing accurate emulation results while only using the existing clocks of the system.
Our approach leverages the fact that most analog blocks in digital systems have inputs that were originally created by another digital block (e.g., a link transmitter, DAC, etc.) before being processed by analog circuits. Thus, in addition to using an event-driven, variable time step approach like [7] , we further accelerate emulation by eliminating the need to create internal analog events. Instead, emulated time progresses directly from one emulated clock edge to the next. Analog outputs are computed as a superposition of step responses to changes in analog inputs that are digitally-driven, meaning that they change only on digital clock edges. We demonstrate that the accuracy of our approach is independent of the time step.
The proposed architecture is presented in Section 2, followed by an implementation example in Section 3 for an 8 GT/s high-speed link transceiver. Measured performance of our architecture on a Xilinx FPGA is reported in Section 4, and Section 5 covers possible extensions to handle nonlinearity and a broader class of inputs. Figure 1 shows our emulation architecture, in which a digital subsystem interacts with analog blocks through an analog dynamics engine (ADE) and a time manager. The ADE transforms digitallydriven analog inputs into analog outputs, while the time manager determines the time associated with each emulator cycle and generates clock edges for the digital subsystem. The architecture is intended to be implemented entirely by the programmable logic of an FPGA, avoiding the need for an analog daughtercard.
ARCHITECTURE

Analog Dynamics Engine
In a conventional mixed-signal emulator, each emulation cycle corresponds to a fixed time step ∆t, with analog dynamics mapped to hardware as FIR and/or IIR filters. However, the accuracy of such an approach generally worsens as ∆t increases. For example, Figure 1 : The proposed emulator architecture. A digital subsystem generates piecewise-constant analog inputs for an analog dynamics engine (ADE) whose outputs are computed as superpositions of pulse responses. A time manager moves emulated time forward from one emulated clock edge to the next and generates clocks for the digital subsystem. if Euler's method is used to generate the filter coefficients, the global truncation error (GTE) is O (∆t), while the GTE associated with the trapezoid rule is O ∆t 2 [10] . As a result, there is a tradeoff between emulation rate and accuracy: high accuracy can be achieved, but at a low throughput, and vice versa. In our architecture, every emulation cycle corresponds to one or more digital clock edges. Under the assumption that analog blocks are digitally-driven, these are the exact times at which analog inputs might change. The ADE therefore has a complete history of the precise times and values of analog input steps, enabling each analog output to be computed as a summation of step responses.
Summation of Pulse Responses
Our approach assumes that the analog blocks being modeled are linear and time-invariant (LTI). Section 3 describes how this approach can be simply extended to handle time-variant systems, and Section 5 discusses how nonlinearity could be handled.
It is well known that an LTI system is characterized by an impulse response f (t), with its output y (t) equal to the convolution of its input x (t) with f (t):
When the input changes only on digital clock edges, the integral can be reduced to a summation. Suppose that x (t) is a general piecewise-constant function:
where u (t) is the unit step function. Substituting Equation 2 into Equation 1 yields:
where F (t) is the system's step response:
Equation 3 can be interpreted as a summation of pulse responses weighted by the sequence of input values x k , as illustrated in Figure 2. Since each pulse response is simply the difference between two values of F (t), the step response can be precomputed once and subsequently used to calculate the system output for any piecewiseconstant input.
As long as the analog input is digitally-driven and the system is LTI, Equation 3 is exact, meaning that the accuracy of our approach does not depend on the size of the time steps taken by the emulator. Notice that since our system makes no assumptions about the width of input pulses, the effects of jitter on the clocks driving analog blocks will automatically be included in the analog outputs. In addition, there is no need to approximate analog dynamics by a rational transfer function, as in [5] , since our architecture makes direct use of the system's step response. This is particularly convenient when working with measured frequency response data such as the S-parameter model of a backplane channel [9] .
Time Manager
The time manager has two tasks: 1) it assigns a time to each emulation cycle, and 2) it generates clock edges for emulated blocks. To maximize the emulation throughput, its objective is to take the largest time step possible without skipping over any clock edges.
In our architecture, every emulated clock stores the time of its next clock edge, and the minimum of these times is selected for the next emulation cycle. Any clock whose edge is to occur at that time will output the edge and update the time of its next edge.
Our time manager design ensures that at least one clock edge is generated in every emulation cycle; there are no "analog-only" cycles. As a result, the emulation rate of the entire mixed-signal system is similar to that of the digital subsystem alone.
IMPLEMENTATION
We implemented our emulator architecture for an 8 GT/s highspeed link transceiver design. As shown in Figure 3 , our emulator has adjustable transmitter (TX) and receiver (RX) equalization, with a clock and data recovery (CDR) loop that is closed through a bang-bang phase detector (BBPD) and digitally-controlled oscillator (DCO). A decision-feedback equalizer (DFE) helps to reduce intersymbol interference (ISI).
Precomputation of Analog Dynamics
The ADE implements the combined analog dynamics of the channel and the continuous-time linear equalizer (CTLE). Channel dynamics were computed from S-parameter measurements [9] , while the CTLE transfer function was based on the PCIE specification, consisting of two fixed poles and an adjustable zero to support adaptive equalization.
In our implementation, the CTLE zero can be positioned in one of 16 different settings between 0.4-2.0 GHz, meaning that the CTLE is not time-invariant. To handle this, the ADE stores a family of precomputed step responses, each representing the combined dynamics of the channel and CTLE in one of its settings ( Figure 4 ). During each emulation cycle, the ADE selects the appropriate step response based on the current CTLE setting.
The dynamics of transitions between CTLE settings is not captured by this approach. However, it is often unnecessary to model these transitions in detail, since they are typically much shorter than the amount of time that the adaptive equalization algorithm spends in a given CTLE setting.
Analog Dynamics Engine
As shown in Figure 5 , the ADE is implemented as an array of taps, each containing a step response lookup table. Following Equation 3, the ADE output is a weighted sum of its input history, with each weight computed as the difference in step response value between two neighboring ADE taps.
Since the ADE has a finite number of taps, the implementation effectively truncates Equation 3. For our system, 85 taps was sufficient to limit the truncation error to a few tenths of a percent. As a guideline, the number of taps should be approximately the ratio of the step response settling time to the TX clock period, independent of the emulation time resolution.
Compressing Step Response Data
To reduce the memory footprint of the ADE, each of its taps is compressed by trimming its domain and by using piecewise-linear approximations.
Domain
Trimming. In general, each ADE tap must evaluate the step response over a different timespan. For example, the first tap in our ADE reads out step response values near t = 0 ns while the last tap reads out values near t = 10 ns. More generally, the kth ADE tap will evaluate the step response at a time between (k −1) and k periods of the TX clock.
Taking advantage of this property, we trim the step response lookup tables individually so that taps store only the data they may actually need. In our emulator, the TX clock has a nominal period T T X and a period jitter that is uniformly distributed between −J T X and +J T X , so its period is guaranteed to lie between T T X − J T X and T T X + J T X . Hence, the domain of the kth lookup table can be trimmed to the following interval:
If the jitter distribution were unbounded, domain trimming would require an approximate time range to be determined for each tap. For example, suppose that the TX clock periods were modeled as independent and identically distributed Gaussian random variables with mean T T X and standard deviation σ T X . The sum of n periods would then be a Gaussian random variable with mean nT T X and standard deviation σ T X √ n. Hence, with high probability (1 − 2 · 10 −9 ), the duration of n TX periods would be nT T X ± 6σ T X √ n. It would therefore likely be acceptable to trim the domain of the kth lookup table to the following interval: In the exceptional case that an ADE tap must evaluate the step response outside of this range, it could do so by extrapolation or by reading from a neighboring tap.
Piecewise-Linear Approximation.
We use piecewise-linear (PWL) lookup tables to store step response data in a memoryefficient manner. As illustrated in Figure 6 , step responses were approximated by a sequence of line segments whose offsets and slopes were stored in lookup tables. For each ADE tap, the number of PWL segments was determined by starting with two segments and iteratively doubling the number of segments until the error in the PWL representation was less than 0.1 %. Within each iteration, linear programming was used to determine an optimal PWL representation.
While not done in this current implementation, a multivariate PWL representation could be used to further reduce the memory overhead of an adjustable model by interpolating between its settings instead of storing the step response of each one.
Time Manager
There are two emulated clocks in our system, one for the transmitter and one for the receiver. The RX clock has two output phases, since the BBPD uses both edges of the RX clock, while the TX clock has only one output phase to represent its rising edge.
Both clocks are implemented using instances of the module shown in Figure 7 . During each emulation cycle, the clock module compares the emulation time, time_in, to the time of its next clock edge, time_out. If the two match, it asserts one of the cke_out signals and increments time_out. The increment includes jitter, which is implemented by scaling the output of a linear feedback shift register (LFSR).
The cke_out signals are used to generate clock signals for each clock phase by gating a free-running 30 MHz emulator clock on a cycle-by-cycle basis. The gating itself is performed by a Xilinx Mixed-Mode Clock Manager (MMCM) IP Block, which ensures glitch-free gating and low inter-clock skew. Step Response PWL Step Response PWL Step Response PWL Figure 5 : Implementation of the analog dynamics engine (ADE). Each pulse response is computed as the difference between two step response values, which depend on the timing of step changes in the analog input. The pulse responses are weighted by values from the input history and summed to form the ADE output. Our emulator uses 85 taps to store an input history spanning about 10 ns, with each tap corresponding to a unit interval (UI) of the link. 
RESULTS
In this section, we discuss measurements of emulator throughput, resource utilization, and accuracy. The Xilinx ZC706 board, which features a Xilinx Zynq-7045 FPGA, was used in these evaluations. Compiling our high-speed link emulator for this platform took 11 min from synthesis to bitstream generation using Vivado 2016.4. Figure 7 : Clock module used to model TX and RX clock behavior. During each emulation cycle, time_in is compared to time_out. If they match, time_out is incremented and one of the cke_out signals is asserted. The clock phase is advanced by rotating mask, and jitter is implemented by adding the scaled output of an LFSR to the period input.
Emulation Throughput
cycles are required to process each unit interval (UI), the emulator throughput is 10 7 UI/s, or 1.25 ms/s.
Performance Comparison.
The CPU simulation rate of a similar high-speed link written in SystemVerilog as a fast functional model for validation was 0.741-1.429 µs/s [7] . In this model, an event-driven approach was used, with analog dynamics represented as a sum of linear filters and waveforms represented by piecewiselinear segments. Our emulation system is at least 875x faster than this comparable, high-performance CPU simulation.
A different fast CPU simulation approach based on s-domain modeling has also been reported [5] . In that case, a high-speed link was implemented using a 50-pole channel model, and a simulation rate of 0.110-0.129 µs/s was achieved. In comparison, our emulation throughput is at least 9,690x higher. In order to isolate the speedup in running our emulator on an FPGA, we measured the performance of a CPU-based simulation of the emulator's SystemVerilog code. This simulation was run using Cadence Xcelium 18.03 on an Intel Xeon E5645 CPU (2.4 GHz) with 96 GB RAM; we measured a simulation rate of 0.192 µs/s. Hence, our architecture runs 6,510x faster on an FPGA than on a multi-core CPU-based simulator. Table 2 shows the FPGA resource utilization of the entire emulation system, demonstrating that no more than 17.1 % of any resource was needed. As a result, there would be ample room left over to emulate a larger digital subsystem. For example, one could build a more complete multi-lane SERDES system including equalization adaptation logic and PCS (Physical Coding Sublayer). Table 3 summarizes the resource utilization of the ADE alone. Note that only 39 % of block RAM (BRAM) in the emulator is used by ADE; the remaining BRAM tiles are consumed by Integrated Logic Analyzer (ILA) IP blocks to capture internal waveforms.
Resource Utilization
LUTs and FFs.
To put the LUT and FF utilization in perspective, one Xilinx MicroBlaze soft processor consumes 2,071 LUTs and 1,672 FFs with typical settings on our FPGA [12] . Hence, our emulator has a LUT and FF footprint equivalent to about five or six MicroBlaze cores. Approximately 100 more such cores could fit in the resources remaining after instantiating our emulator. 
BRAM.
Our individual optimization of the ADE taps reduced the BRAM tile requirement by more than 22.5x, from 2,097 (which would not have fit on our FPGA) to 93. As a result, the largest consumers of BRAM in our emulator are the ILA blocks, rather than the PWL tables in the ADE. Figure 8 shows the optimized lookup table size for each ADE tap. More memory is required to represent parts of the step response that are rapidly varying, such as the area around t = 4 ns in Figure 4 . However, the majority of tables (92 %) require no more than a half BRAM tile, which is the minimum unit that can be allocated on our FPGA.
DSP.
Most (97.9 %) of the DSP utilization is attributed to the ADE. Each of its 85 taps has two multipliers: one to implement its PWL table, and one to weight its input value by the corresponding pulse response. Since each DSP slice in our FPGA contains a single multiplier, the expected number of DSP slices consumed by the ADE is therefore around 170. The actual DSP utilization is 18.3 % lower, since some low-precision multiplications synthesized to LUTs.
Emulation Accuracy
The accuracy of our emulator is evaluated in two ways. First, the waveforms of our FPGA emulation are directly compared with those of a CPU simulation. Second, we compare high-level behavioral metrics between the two approaches.
Compared to the CPU simulation, there are several potential sources of error in the emulation which include: 1) quantization error due to a fixed-point representation, 2) PWL approximation error in representing step responses, and 3) truncation error due to the finite number of pulses used. Figure 9 shows a sample comparison of the transient waveforms.
4.3.2
Behavioral Accuracy. Figure 10 shows a comparison of the CDR startup waveforms from our emulator and from a CPU simulation. They are in generally good agreement, with 10 % settling times of 513 ns and 524 ns for the emulation and simulation, respectively. Figure 11 compares amplitude histograms at the DFE output. The standard deviation about "0" and "1" levels was 39.2 mV for the emulation and 37.4 mV for the simulation. Note that the CPU simulations used in these two comparisons did not include clock jitter.
EXTENSIONS
In this section, we describe possible extensions to model nonlinearity and handle a broader class of analog input signals.
Handling a Broader Class of Inputs
Up until this point, we have used the term digitally-driven to refer to analog input signals that change only on digital clock edges, meaning that they are that are piecewise-constant. However, it is possible 1 Relative error is defined as max |y FPGA − y CPU | /max |y CPU |, where y FPGA and y CPU are the FPGA and CPU waveforms, respectively. Figure 11 : Amplitude histogram at the DFE output, constructed from emulator data. The dashed curve represents the distribution of comparable data gathered in a CPU simulation.
to broaden this definition to include piecewise-polynomial waveforms, which can at least approximately represent any arbitrary analog signal.
Suppose that an analog signal is comprised of polynomial segments of degree n:
Assuming that this signal is supplied as input to a system with an impulse response f (t), the resulting output will be:
Applying the Binomial Theorem yields:
The n step response-like functions F k could be precomputed for use during emulation and, as before, these functions could be implemented using PWL tables.
Modeling Nonlinearity
Memoryless nonlinearities occurring at the input or output of an analog block are straightforward to model in our architecture, since they can be implemented by lookup tables outside of the ADE. For example, in our high-speed link implementation, the transfer function from the DCO code n to the RX clock period T RX is given by the nonlinear relation T RX = 1/(α + βn). We used a small (160 bit) PWL lookup table to implement this behavior.
Our digitally-driven approach could also be used to model a block governed by the first-order nonlinear dynamics y = д (x, y), where x is its piecewise-constant input and y is its output. Assuming that x is constant from t 1 to t 2 , the value of y at the end of that time interval can be written as a function of the interval length and the values of x and y at the beginning of the interval:
In principle, the function G could be precomputed for use during emulation. While this approach might not always be practical, there are at least two cases that lend themselves to efficient implementation.
First, if the system's differential equation can be solved analytically for a constant input, then G may have a convenient closedform expression. For example, an integrator that saturates to ±1 could be represented by:
y (t 2 ) = min (max (y (t 1 ) + (t 2 − t 1 ) · x (t 1 ) , −1) , 1)
Second, if the input is restricted to certain discrete values, as in a high-speed link, then it may be possible to represent G using a small number of precomputed trajectories.
As an example, consider a filter governed by the nonlinear dynamics y = (x − y) /τ (y), with τ positive. Assuming that the input is limited to values of −1 and 1, there are effectively two unique output trajectories: one starting at 1 and decreasing towards −1 (with input −1), and the other starting at −1 and increasing towards 1 (with input 1). These two trajectories could be precomputed for use during emulation, so that whenever the input value changes, the trajectory corresponding to the new input value could be played back starting from the output value at the time of the input transition.
CONCLUSION
In this paper, we described an FPGA architecture for emulating a mixed-signal system with a digitally-driven analog block; that is, one whose input changes only on digital clock edges. The analog output of such a block is computed as a weighted sum of pulse responses, and is calculated in a way that allows the emulator to progress directly from one digital clock edge to the next. Unlike a conventional oversampled approach, the emulator's accuracy is independent of time step size.
Using an 8 GT/s high-speed link transceiver as an example, we implemented the proposed architecture on a Xilinx Zynq-7045 FPGA. The emulation rate achieved was 1.250 ms/s, which represents an 875x improvement over a high-performance CPU simulation of a similar system. The worst-case error observed in comparison to an idealized CPU computation of the analog dynamics was -0.7/+1.1%.
We conclude that the proposed architecture is appropriate for verifying the behavior of mixed-signal systems over long time scales, where CPU simulation would be impractically time-consuming. Owing to its low resource utilization and ability to handle multiple clock domains, we expect that it will scale well to large designs.
ACKNOWLEDGMENT
This work is supported by National Science Foundation Grant No. 1509126, a Hertz Foundation Fellowship, and a Stanford Graduate Fellowship.
