Abstract-This paper presents a review of the research status for real-time simulation of electric machines. The machine models considered are the lumped parameter models, including the phase-domain, d-q, and voltage-behind-reactance models, as well as the physics-based models, including the finite-element method and magnetic equivalent circuit models. These models are initially presented along with their relative advantages and disadvantages with respect to modeling fidelity and their computational intensity. A field-programmable gate array, a graphics processing unit, a chip multiprocessor, and computer clusters are the main hardware platforms for real-time simulations. An overview of such hardware platforms is presented and their comparative performances are evaluated with respect to real-time simulation of electric machines and drives on the basis of simulation acceleration, machine types, and modeling methodology.
I. INTRODUCTION

M
ODELING is the mathematical representation of a physical phenomenon. Simulation is the numerical representation of such models on a computing machine. Accurate fast modeling and analysis environments are required for design optimization, dynamic characterization, controller design, and transient stability assessment of electric machines. Electric machines account for about 65% of the total energy consumption in industrial sectors [1] . The use of electric machines is just as likely to increase in coming decades for the given expected penetration of plug-in hybrid electric vehicles and distributed energy resources (e.g., wind farms) [2] - [5] . To facilitate an industry-wide transformation to highly efficient electric machines and optimal design of novel machine structures, accurate models and high-performance simulation tools are indispensable [6] . Investigation of such tools will enable: 1) a significant reduction in design cycle time for more efficient electric machines, thus speeding up the product-to-market procedure; 2) a more realistic representation of machine-drive systems, thus enabling smarter control of energy flow in 65% of the industrial loads; and 3) cosimulation of field equations and grid dynamics to accurately study the effects of machine-to-grid integration (as in wind farms). Machine models can be broadly categorized into lumpedparameter and physics-based models. Lumped parameter models, such as phase-domain (PD), direct-quadrature (d-q), and voltage-behind-reactance (VBR) models, have been heavily studied over the last half-century [7] . Such models are simple and faster to simulate, but they are often based on unreliable approximations. For example, while d-q models are suitable for design of electric drives, they neglect spatial and nonlinear effects and assume sinusoidally distributed windings. Overall, such models are suitable to study small changes to established designs, but they fall short when considering radical designs or more accurate representation. The physicsbased models, on the other hand, are based on established physical principles and lead to more simulation fidelity [8] . Finite-element method (FEM) models and magnetic equivalent circuit (MEC) models are the most popular physics-based models [9] . The finer modeling and simulation resolution offered enables studying finer issues such as various internal faults or localized magnetic saturation in machines [7] . FEM models are computationally very intensive, and the model size and time step constraints make them unsuitable for use in a real-time simulation environment [10] , [11] . Hence, they are mostly used for the final design verification. MEC models, in general, are seen as a compromise between FEM and lumped parameter models [12] . MEC models take machine geometry and materials characteristics into account and are thus more refined than lumped parameter models [13] - [15] . The circuit-oriented approach makes them a more intuitive design tool compared with FEM models [10] , [16] .
For the last half-century, offline simulation was the major tool in design verification and testing before hardware prototyping and field deployment [19] , [20] . Once the machine models grew in complexity, even for a moderately sized machine, the software overhead associated with offline simulation made them a less attractive option. Alternatively, in a real-time simulation environment, the model waveforms are reproduced within the same time interval as they would had been in actual physical time. A real-time emulated machine model, therefore, allows engineers to evaluate the controller, machine, or drive performance under a wide range of contingencies and extreme conditions in a nondestructive environment before field deployment [21] .
2332-7782 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Parallel hardware platforms can be used to accelerate the simulation, thereby achieving real-time behavior of highfidelity machine models. Hardware-assisted real-time simulation of electric machines includes the digital implementation of the machine model on a hardware platform (Fig. 1 ). The machine model is characterized by a set of ordinary differential equations (ODEs). The modeling complexity and the resulting computational intensity, i.e., the order of the ODEs, vary from model to model. These ODEs are solved using classical numerical integration techniques such as Runge-Kutta and Euler methods implemented on the hardware platforms. The hardware platforms consist of multiple computing units by which computational load is distributed to each unit processing independent sets of data. A field programmable gate array (FPGA), a graphics processing unit (GPU), and CPU-based simulators, i.e., Chip Multiprocessor (CMP), and Computer cluster, are the hardware platforms currently used for real-time simulation of electric machines [22] - [24] . For the most part, the machine models used are the simplified and crudely approximated lumped parameter models. Traditionally, CMPs and computer clusters have been used for real-time simulation of d-q models [24] - [27] . MEC models have also been simulated in real time using CMPs [16] . However, both computer clusters and CMPs might suffer from the high I/O latency and hence the difficulty in achieving nanoscale time steps. This drawback makes them less suitable in simulating highfrequency phenomena such as PWM electric drives. On the other hand, with advancements in Very Large Scale Integration technology, FPGAs are gaining popularity in digital hardware realization of electric systems [10] , [28] - [34] . FPGAs have very low I/O latency, and their nanosecond time step makes them ideal for the real-time simulation of high-frequency excitation signals. However, the fact that FPGAs have limited hardware resources challenges their capability in simulating very large-scale power systems [33] including multitudes of electric machines. While the models can be broken down into smaller ones, their relatively high cost and potential intermodule latencies discourage one from using multiple FPGAs in a simulation environment. GPUs are an emerging alternative with massively parallel architecture making them suitable for simulating large-scale numerical field analysis of electric machines [35] . However, CMPs provide faster results than GPUs for relatively smaller model orders [36] , [37] .
This paper presents a perspective on the choice of suitable hardware-assisted simulator platforms to mimic real-time behavior of high-fidelity machine models. The key to achieving real-time behavior lies in completing all data transfer and processing within the simulation time step. This is achieved by distributing the computational load across multiple processing units with each unit processing independent sets of data. In general, lumped parameter models have been implemented on both FPGAs and CPU-based simulators. Since the lumped parameter models are computationally economical, GPUs are not considered to be the first choice simulator due to their high setup and communication times. For higher fidelity models (e.g., physics-based models), GPUs and FPGAs have been primarily used to provide the necessary computational power thanks to their high number of computing units. In such cases, the large number of computations typically involved is a burden on simulation time for CPU-based simulators. The rest of this paper is organized as follows. Section II discusses the existing machine models used in a real-time simulation environment. Section III presents the hardware-assisted simulation platforms used for real-time simulation of electric machines. Section IV presents a comprehensive comparison of these platforms based on modeling methodology, machines types, and simulator types. Finally, Section V provides the concluding remarks.
II. ELECTRIC MACHINE MODELS
Electric machine models can be broadly classified into lumped parameter and physics-based models [9] . The model choice depends on the study objective and the fidelity required. Lumped parameter models are based on the coupled electric circuit approach, which simplify the system descriptions to predict machine behavior [38] . Overall, lumped parameter models are suitable to predict the dynamic responses of electric machines when integrated in an environment with slow dynamics (e.g., interfaced with power grid). Physics-based models, on the other hand, are based on established physical principles [8] . They have more modeling fidelity that naturally leads to a more fundamentally reliable simulation. They consider geometrical complexities, eddy currents, saturation, and hysteresis effects. From the digital simulation perspective, it is obvious that lumped parameter models can be simulated faster but might lack simulation fidelity, while the physicsbased models are more reliable but can be computationally intensive.
A. Lumped Parameter Models
Lumped parameter models can be generally categorized into the PD, the d-q, and the VBR models [7] . In the PD models, the stator and rotor circuits are directly interconnected with the external drive network. Despite simplicity, PD models can be slow because of rotor position dependency of model parameters. Applying the Park transformation to the PD models yields the d-q models with constant inductance matrices, which simplifies numerical execution of machine equations considerably. Since machine variables are represented in d-q coordinates, they must be converted back into abc coordinates when interfacing with drives or external power systems. VBR models achieve a direct interface of the stator circuit with an external circuit, while retaining simplicity of the d-q approach. These models are computationally affordable with good numerical stability and accuracy even when using large time steps. The model description used in each methodology is given as follows.
1) PD Model:
In PD models, the machine variables are expressed in abc coordinates. The voltage equations of a symmetrical three-phase induction machine can be expressed as follows:
The flux linkages are expressed by
where
V , λ, and I represent the voltage, flux linkage, and current, respectively. The subscripts (abc)s and (abc)r represent the three-phase stator and rotor quantities, respectively. r s and r r are the stator and rotor diagonal resistance matrices, respectively. L s and L r are the stator and rotor inductance matrices, respectively. L sr is the mutual inductance matrix. θ r is the angle between the as and ar axes of the stator and the rotor.
The PD models have the stator and rotor circuits directly connected with the external drive network, resulting in the simultaneous solution of a machine's physical variables and the network equations. The dependence of the matrix L ( θ r ) on the rotor position can increase model complexity and computational cost [7] .
2) d-q Model:
The d-q model is derived from the PD model using the Park transformation. The general transformation for a three-phase induction machine is
In (4), f can be voltage, flux linkages, or currents. θ is the reference frame angle. The subscript qd0 refers to the transformed variables. The voltage equations in the d-q model, with all the rotor variables referred to the stator, can be written as [38] 
where ω = (dθ/dt) is the reference-frame speed and ω r = (dθ r /dt) is the rotor's angular speed. The flux linkages are given by
The main advantage of d-q models is the constant inductance matrices, leading to a faster simulation execution. However, the machine model-external network interfacing becomes complicated, leading to a less reliable numerical result when using large simulation time steps [7] , [39] .
3) VBR Model: In this model, the stator variables are expressed in abc coordinates, while the rotor variables are expressed in d-q coordinates. Several VBR models have been derived depending on the application [39] - [42] . Here, the VBR-III model of an induction machine, where resistance and inductance matrices are diagonal, is discussed assuming ungrounded stator windings. The stator voltage equation is
The so-called subtransient voltages e (abc)s are
The rotor voltage equations are expressed as
Since the stator circuit parameters (RL) are constant and decoupled from each other, the stator of a VBR model can interface directly with the external network, which makes simultaneous solution of a machine's physical variables and the network equations possible. The numerical stability and accuracy of a VBR model is ensured even at relatively large time steps [7] , [40] . Also, the VBR model gives more accurate results than the PD model due to its rescaled and improved eigenvalues [39] , [43] .
B. Physics-Based Models
Physics-based models include FEM and MEC models [8] , [44] - [46] . MEC models are more intuitive than FEM models while remaining based on physics [44] , [47] . MEC models are easily parameterized and allow for a fast parameter sweep, which is ideal for iterative design optimizations [48] , [49] . The extension to 3-D cases is straightforward as the MEC model uses tube elements rather than the point elements used in FEM models. It is shown in [9] that MEC is a more promising design and analysis tool, especially when the machine is pushed into saturation.
1) FEM Model:
Here, the machine geometry is approximated by finite elements, e.g., triangles for 2-D FEM and tetrahedrons for 3-D FEM [50] . Maxwell's equations are solved for each element
with the material equations
where H is the magnetic field intensity, B is the magnetic flux density, E is the electric field intensity, J is the current density, υ is the magnetic reluctivity, and σ is the electrical conductivity. Here, the displacement current component (∂D/∂t) is neglected due to low frequencies encountered in electric machines. Using the magnetic vector potential A, the magnetic flux density B, and the electric field intensity E can be expressed as
where V is the electric potential [51] . For an induction machine with solid conductors, (15)- (17) give the following governing equation for a 2-D electromagnetic field:
The numerical field equation is obtained by multiplying (18) with the shape functions and integrating the product over the whole finite-element mesh. This results in a large set of simultaneous equations containing unknown node potentials.
As an example of the FEM model of electric machines, the 2-D model of a 3-phase, 45-kW, 4-pole squirrelcage induction machine is simulated using motor analysis software [52] and illustrated in Fig. 2 . The eddy current is automatically included in a dynamic FEM model. However, FEM models are not intuitive and do not facilitate design [9] , [23] . Including system nonlinearities (e.g., saturation) and geometrical complexities (e.g., laminations and 3-D effects) substantially complicates model formulation. In particular, the element counts in 3-D FEM increases based on the cube of the linear dimensions. Optimization routines of magnetic components can require on the order of 10 6 solutions. Although significant work has been done on meshing and sparse matrix algebra [53] - [56] , FEM models are still computationally prohibitive and time consuming to use in an iterative design optimization framework or for cosimulation of machine-grid dynamics [11] , [16] , [23] .
As an example, we ran a 2-D FEM simulation of a 8/6 switched reluctance machine (SRM) using the commercial software MagNet 7.3 [57] on a PC with Intel Core i7 CPU at 3.4 GHz. For a 200-Hz excitation of the SRM, the resulting simulation time was found to be about 50 000× longer than the real time. Given their high fidelity, FEM models are mainly used in the offline simulation of electric machines [10] , [16] , [58] to validate real-time simulation results.
FEM models [11] , [59] - [68] have been indirectly used in the real-time simulation of electric machines due to their computational time issue mentioned above. In these studies, the inductance matrices are precomputed as a function of rotor position and stator currents using commercial software packages such as Maxwell or JMAG [69] , [70] . These inductances are then tabulated in large lookup tables and used directly in real-time simulation, thereby avoiding the time-consuming large-scale matrix calculation in real time. For example, in [11] , [64] , and [65] , finite-element analysis of a permanent magnet synchronous motor (PMSM) precomputes the cogging torque, back EMF, flux linkage, and the inductance matrix as functions of rotor position and stator currents offline. These finite-element solutions were stored in lookup tables, in the Simulink environment, and then used directly during the realtime simulation of the enhanced PD model.
2) MEC Model: In the MEC model, the machine's spatial structure is divided into small elements, the so-called permeance elements, to form a reluctance network (Fig. 3) . The smaller the elements, the more accurate the resulting model will be. Each element is usually represented by a reluctance term , based on the material data and physical geometry of the element, as
where μ is the permeability of the magnetic material, l and A are the length and cross-sectional area of the flux path, respectively. The element can potentially include a voltage source to represent the magnetomotive force F and/or a current source to represent the permanent magnet φ P M . The MEC model is solved to find the magnetic scalar potentials u at the centers of the permeance elements. From there, one can find the flux through each permeance element
The flux φ and the flux density B can further be processed to find the core loss from the material datasheet. The Maxwell stress tensor (MST) method is widely accepted for force calculation in MEC [71] , [72] . In this method, an integration contour, surface S in Fig. 4 , is considered on an object, e.g., a rotor tooth. The force F m can be expressed as
where S e is the integration surface in the reluctance element, e. n e is the normal unit vector of S e , and T MST,e is the MST. For rotational motions, the mechanical torque can be calculated through the cross product of the displacement vector and the mechanical force. Over the last decade, MEC models have been growing as an alternative to lumped parameter and FEM models [10] , [14] , [16] , [71] , [73] , [74] as they can provide acceptable accuracy with reasonable computational effort. To achieve this, a fullorder MEC model is typically order reduced. For example, we simulated an MEC model of a two-winding transformer on a MacPro computer with an Intel Xeon processor at 2.93 GHz and 64 GB of RAM using ODE 23tb in MATLAB/Simulink. The simulation of the full-order model, with 300 state variables, took 1200 seconds of CPU time. The simulation of the reduced-order model, with only four state variables, achieved a speed-gain of 500×. However, crude geometrical simplifications often degrade the accuracy of conventional MEC models. Moreover, MEC models normally omit a precise consideration of eddy currents. Static MEC models are valid for very low-frequency excitations and are not suitable to use with PWM inverter drives [75] , [76] . Although more intuitive than FEM models, MEC models are not fully automated yet and, to some extent, depend on expert knowledge (e.g., a priori knowledge of major flux paths).
III. HARDWARE PLATFORMS FOR REAL-TIME SIMULATION
A. Perspective on Real-Time Simulation
Real-time simulation is a discrete-time simulation in which the system state is assumed to change only at fixed time instants, synchronized with the real-time clock. The implementation of real-time simulation on hardware platforms contains three different tasks as shown in Fig. 5 . These include receiving or sending signals with I/O pins, executing computations, such as solving the ODEs characterizing a machine model in parallel processing units, and transferring data between the processing units or from the processing units to the main memory. To mimic the real-time behavior, all these tasks must be completed within the simulation time step. In real-time simulations, explicit numerical integration methods such as fourthorder Runge-Kutta (RK4) and forward Euler methods have been typically used to solve the resulting ODEs [77] - [80] . Since the machine model equations are typically nonlinear, using the implicit methods to solve them is difficult. The interested readers can find details of these numerical solvers in classical resources on scientific computing and numerical analysis [81] , [82] . Numerical precision, computational time, and resources are the key parameters considered while choosing the most appropriate integration method. For example, the higher order RK4 integration algorithm not only results in better numerical precision but also requires greater computational time and resources. The Euler method, on the other hand, is a first-order integration algorithm and, hence, requires less computational time, but has lower numerical precision, particularly if the integration step size is not sufficiently small [79] , [80] , [83] . Based on the simulation setup, real-time simulation can be classified into two categories: hardware-inthe-loop (HIL) and software-in-the-loop (SIL) simulation [84] . In HIL simulation, the device under test, such as an electric machine, is physical while the remainder of the system is simulated on a hardware platform. HIL simulation is particularly useful when it is difficult to get access to the real system or when it is too risky/extreme to directly test on the actual system. Therefore, this approach has the advantage of substantially accelerating design and testing validation, reducing cost while preventing a possible damage to the actual system [85] , [86] . In SIL, on the other hand, the entire system is simulated in real time in the same hardware platform. SIL has the advantage over HIL in that no physical inputs and outputs are used, thereby preserving signal integrity and performing a larger number of computations in quick time [87] , [88] . However, the simulation is not always accurate given the simplifications in the modeling/abstraction phase [89] .
One main parameter in real-time simulation is the simulation time step size. The computational capability of the hardware platform used and the efficiency of the software code determine this time step. To ensure simulation fidelity, this time step is required to be sufficiently short [84] . However, making it too short results in overrun, i.e., loss of real-time behavior. Therefore, selecting the appropriate time step size is a compromise between simulation fidelity and frequency bandwidth. This issue becomes particularly challenging when simulating high-frequency drives that typically require a time step in the range of nanoseconds.
For the last half-century, simulation of electric machines was done using sequential programs on single-core CPUs [19] , [35] , [37] . These simulations suffered from long execution times. Parallel processing can be used to accelerate the simulation in hardware components such as the FPGA, the GPU, the CMP, and computer clusters. In parallel processing, the computational load is distributed across multiple processing units, with each unit processing independent sets of data. The processing of each data set is sequence independent. Therefore, models such as FEM models of electric machines, comprising many independent mesh elements, are suitable for implementation on these hardware platforms [35] . It is, however, required that the execution time of each processing unit is less than the simulation step size to save time for data transfer.
B. CMP
A CMP, also called a multicore CPU, is a single die comprising multiple streams of processing units (cores) that execute program instructions (Fig. 6) [90] . These cores have a shared memory and some shared cache and can run multiple instructions simultaneously, making them ideal for parallel programming. CMPs often achieve this parallelism through multithreading [91] . Unlike single-core CPUs, where parallelism achieved by mutithreading is limited, CMPs can exploit multithreading to achieve significantly higher performance. The programming model is first partitioned into independent parts or threads that can run simultaneously. CMPs then execute the threads in parallel over various cores. The performance is boosted further by exploiting the instruction-level parallelism among individual instructions within each thread. The CMPs should have sufficient cache and memory as multithreading results in only modest gains in performance or even slowdowns in their absence [92] . CMPs also support high clock rates and multipipelines that speed up the simulation of sequential applications [93] . CMPs can be classified as homogeneous and heterogeneous. In homogeneous CMPs, all cores are identical in functionality. In heterogeneous CMPs, cores differ in their characteristics, e.g., cache size, core type, and clock frequency [94] . Designers can optimize applications by wisely partitioning different workloads on different cores. CMPs offer plenty of computational power, memory, and coding flexibility by supporting user-friendly block diagram environment-based software packages like Simulink and Mathwork's real-time workshop. This ease of coding and their relatively low cost make CMPs the traditional platform to simulate electric machines and drives [16] , [26] , [68] , [95] - [98] , using simulators like HYPERSIM, RTDS, and RT-LAB [99] - [101] . As expected, the simulation speed increases with the number of cores in the CMP [102] - [104] . Harakawa et al. [98] performed the real-time simulation of a complete PMSM drive on a CMP with three cores. The drive circuit was separated into two parts and each part was simulated on separate cores using different computational threads (multirate simulation) assigned to each part. The simulation time steps were determined according to each part dynamics and a speedup of ∼250× over offline simulation was achieved. Likewise, Asghari and Dinavahi [16] used multicore CPUs to achieve real-time simulation of a vector-controlled MEC model of an induction machine where each subsystem was simulated on a different core. The decoupling approach allowed using multirate time steps for each subsystem depending on its dynamics. In [104] , a shipboard power system was simulated in a quad-core CPU. A multithreaded program was executed to solve the subsytems sequentioparallely on the four cores of the CPU. An unusually high-speed gain of 20× was achieved over single-core CPUs.
The main disadvantages of CMPs are the high I/O latency and hence the relatively high computation time step. The PCI bus transaction time delay makes time steps less than 1 ms hard to achieve with present-day CMPs. Therefore, simulating very high-frequency PWM machine drives, having hundreds of I/O points, is not practical on CMPs [105] . However, the I/O latency can be minimized using an FPGA-based card as the I/O interface [106] .
C. Computer Cluster
A computer cluster is a group of computers bound together through local area networking. The motivation of using computer clusters arises from the need to provide the computational power in simulating models that cannot be accommodated by CMPs. The computer cluster system is scalable; it can be expanded, depending on the model complexity and the computational intensity, to achieve smaller time steps [107] . The architecture of a computer cluster is composed of one or more hosts (master node) and target computers (computational nodes), communication links between the computers, and various I/Os to connect the computers to an external device [101] . The master node basically schedules and controls the cluster, while the target computers behave as computational nodes. The computational nodes can run concurrently and, thereby, be used in the parallel execution of instructions.
Each computational node may comprise of multiple CPUs. The CPUs belonging to different nodes communicate via an interconnection network such as InfiniBand or high-speed Ethernet links [108] , [109] . The presence of these communication channels in a computer cluster induces high latency and, thus, leads to a relatively high communication time compared with CMPs. The programming model is also more complicated in computer clusters as multithreading may have to be combined with message passing interface (MPI) to achieve parallelism [110] . Computer clusters have been used in simulating ac machine drives [26] , [109] , [111] - [114] , giving faster results than offline simulation. Real-time simulation of a grid-connected wind farm was performed on a computer cluster in [24] and [102] , resulting in the speed gains of 150× and 55×, respectively, over offline simulation. In [107] , a power system was simulated using a cluster of five PCs where a speed gain of 3.5× over a single PC was reported. Paquin et al. [108] proposed a multi-CPU based cosimulation framework for shipboard power systems on a computer cluster. The framework was based on enabling integration and interoperability among the heterogeneous system models, each possibly simulated on a different computer. Pak and Dinavahi [26] performed the real-time simulation of a gridconnected wind turbine generator on a computer cluster of two nodes. Order-reduction techniques were applied to the overall model to satisfy the time constraint for real-time simulation. A computational time step of about 50 μs was achieved.
One challenge in the computer cluster scheme [115] is the inherent communication latency among the computers in each simulation time step. To achieve a real-time performance, the transmission of required data among the computers must be done in each time step and the computers be synchronized. Considering different computers' characteristics and/or the presence of asymmetrical computational works in the cluster, the real-time simulation must allow the slowest slave node to compute its solution as well as to transfer the data to the other nodes in less than the imposed time step.
D. FPGA
The generic architecture of an FPGA consists of three programmable components: logic blocks, input/output blocks, and interconnection resources [116] . The logic blocks comprise logic gates, multiplexers, and lookup tables, and they are capable of performing different combinational and sequential logic functions. The input/output blocks provide the interface between the external pins and the internal logic blocks and can be dynamically configured for input, output, or bidirectional operations [117] . The interconnection resources are typically electrically programmable switches. They set up the routing paths to connect the logic blocks and the input/output blocks [116] , [118] . Modern FPGAs also have specialized blocks such as multipliers and DSP slices to speed up calculations [119] . FPGAs have very low I/O latencies as the system I/Os are directly connected with the FPGA without any PCI-express (PCIe) bus. This ultralow latency allows FPGAs to perform real-time simulation with time step sizes in the range of nanoseconds.
Based on the numerical representation, the FPGA architecture can be categorized into fixed-point [85] and floating-point configurations [120] . According to IEEE Standard 754-2008, the floating-point format can be classified into binary32 (32 b), binary64 (64 b), and binary128 (128 b) formats. The higher precision formats provide greater numerical precision but require more computational time and hardware resources. In an offline simulation environment, the floating-point representation provides more accurate results, as it can represent a wider range of numbers, but it requires more hardware resources and can only achieve time step sizes in the range of microseconds. The fixed-point format has the advantage of realizing very small time step sizes, e.g., in the range of nanoseconds [31] , [33] . The choice between these two numerical format is, therefore, a compromise between simulation speed and fidelity and hardware resource utilization.
The implementation approach can be parallel [121] , [122] , or pipelined [10] , [33] , [123] . In the former, one data per hardware module per time step is computed, while in the latter, multiple data per hardware module per time step are computed. The parallel implementation means more hardware resource utilization with the advantages of simplicity and smaller time steps. The pipelined implementation, on the other hand, requires less hardware resources but is more complex and requires longer time steps. The pipelined implementation [122] , [124] is used when a model, such as a power system with a large number of electric machines, is too large to be efficiently accommodated in a single FPGA using the parallelization scheme. Table I compares floating-and fixed-point arithmetics and parallel and pipelined realization schemes.
Machine models can be emulated on FPGAs using textual programming languages (TPLs) such as Verilog and very high-speed integrated circuit hardware description language (VHDL) or using a schematic method that is based on a block diagram-based environment such as Simulink [33] . In the schematic method, specific block sets such as Altera DSP Builder and Xilinx System Generator within MATLAB/ Simulink generate the hardware description language (HDL) code automatically, which makes it user friendly. Hence, the schematic method facilitates design by nonexpert FPGA users [125] without any HDL coding. The schematic method is easy to debug, but its applications are limited to simple machine models. Also, it consumes more hardware resources and results in slower simulation time than the TPL method [126] . The TPL method, on the other hand, is more flexible but also more challenging to code and debug. It can be optimized by proper designing to consume less hardware resources.
The block diagram of the implementation of a machine model on an FPGA is shown in Fig. 7 . The machine is first modeled as a set of ODEs. To solve these ODEs using numerical integration methods, such as Runge-Kutta, the schematic approach starts by implementing the machine model in modelbased tools, such as MATLAB/Simulink. The implemented model is then discretized to get a digital model using software packages, such as Xilinx ISE and Altera Quartus II [17] , [127] . At this stage, the automatic HDL code representing the model is generated. The main control and interface files along with I/O pin assignments are formed. The produced binary file is then mapped onto the FPGA board. The same procedure can also be implemented by manually compiling the entire algorithm and FPGA architecture using a TPL, (e.g., VHDL and Verilog) and then mapping the binary file on the FPGA board. The developed code is typically validated using HDL simulators such as ModelSim and ISim before being downloaded onto the FPGA board [128] - [132] . This HDL simulation is necessary to verify if the developed machine model is accurate or not. If the simulation results show significant discrepancy with the actual values, then the model description is modified and simulated again until satisfactory results are obtained. Hence, the risk of significant changes in the code for FPGA implementation is greatly reduced [128] .
FPGAs are now used widely in the real-time simulation of complex systems, such as power converters [85] , [86] , [121] , [133] - [139] , electric machines and drives [10] , [22] , [29] - [34] , [67] , [85] , [140] - [154] , and power systems [118] , [120] , [122] - [124] , [155] , [156] . Significant simulation speed gain, over CPU-based software simulation, has been reported, such as 54× for d-q models of electric machines in [29] , 6000× for FEM models of electric machine in [157] , 120× for d-q models of drives in [141] , and 50× for power systems in [156] .
In the real-time simulation of a power system, the computational time step [120] is mostly dictated by the computations of ac machines. Time steps in the range of microseconds [33] and nanoseconds [145] for the machines have been achieved for the floating-point and fixed-point calculation basis, respectively. The impact of machine nonlinearities, such as saturation, on the time step is assessed by Herrera et al. [145] , reporting computational time steps of 200 and 650 ns for unsaturated and saturated cases, respectively. When the logic resources of an FPGA are limited in simulating large-scale power systems, Chen and Dinavahi [122] propose a modular architecture based on using multiple FPGAs. Compared with machines, drive systems require a smaller step size on the order of nanoseconds, given their high-frequency content. Time steps up to the range of hundreds of nanoseconds have been achieved in [85] , [86] , [122] , and [135] - [137] .
FPGAs have relatively limited hardware resources. This limits their floating-point computational capacity [117] , [158] . As the model order increases, this hardware resource limitation becomes more pronounced. To remediate, multiple interconnected FPGAs [122] can be used or the pipelined realization [33] can be adopted. However, this pipelined implementation requires complex VHDL coding and an in-depth knowledge of the FPGA architecture.
E. GPU
The GPU is a massively parallel multicore processor with unique floating-point performance and programmability [159] . Unlike CMPs, GPUs are designed specifically for high-speed data processing by allocating vast amount of resources for processing rather than for caching and flow control [35] . This feature allows GPUs to deliver much higher performance than CPUs at a lower cost and lower power consumption for suitable computations [160] . GPUs have architectural differences based on their manufacturers. NVIDIA and AMD are the two leading manufacturers of GPU [18] , [161] . GPUs use multiprocessors for computational purposes with each multiprocessor composed of many cores that share registers, caches, and dedicated memory as shown in Fig. 8 . The computing element in a GPU is called a thread, and each thread corresponds to one data element. Groups of threads form a block and groups of blocks form a grid. GPU functions are called kernels. Whenever a kernel is invoked, these blocks and grids are created. Individual identities are assigned to each thread and block by the GPU coding language. Threads recognize the data to work on and the instruction to execute by inspecting their own thread and block identities. All the threads inside a grid execute identical instructions but on different data sets. This is known as the single instruction multiple data format. Threads in each block have access to the shared memory in the multiprocessor and to the GPU's global memory [37] , [160] . Fig. 9 represents the implementation of a national multimachine wind farm on a GPU.
A GPU always work in conjunction with a host CPU. This CPU can either serve as a coprocessor or can only control the simulation flow [160] , [162] . When serving as a coprocessor [160] , the serial parts of an algorithm are executed on the CPU, while the parallel parts run on the GPU. The GPU interacts with the CPU via a PCIe bus. Data are first transferred from the CPU to the GPU's global memory via the bus. The transfer speed is limited by the PCIe bus capacity, which induces latency and, thus, leads to a higher set-up time compared with other platforms [163] . The memory of a GPU [164] can be categorized into global memory, shared memory, and local memory. All processed data by the GPU are copied into the global memory before transferring them to the CPU. However, the global memory is not on-chip, and hence accessing it induces latency. This leads to a relatively higher communication time. This latency can be reduced by exploiting the much faster shared memory and accessing the global memory less frequently. The key to GPU performance improvement, therefore, lies in designing algorithms to optimize the use of these memory levels [35] , [37] .
Coding is done in GPUs using general-purpose coding languages such as the C-based Compute Unified Device Architecture (CUDA) [165] - [168] and the Open Computing Language [169] , [170] . CUDA has gained popularity because of its user-friendly nature and a relatively straightforward interfacing with MATLAB and other simulators with C-language block support [35] .
The parallelism feature of GPUs is making them a popular option in the numerical field analysis of electric machines [23] , [35] , [170] , simulation of electric machines [171] - [173] , transient analysis of power converters [174] , power-flow computations [175] , and power system transient simulations [37] , [160] , [176] , [177] . Simulation speed gain over CPU-based software simulation is reported for different applications, such as 420× for transient stability simulation of power systems [160] 17× for electric machines [173] , 10× for transient analysis of power converters [174] , and about 40× for electromagnetic transient (EMT) simulation of power systems [178] to name a few. However, unlike FPGAs, it is difficult for GPUs to achieve a time step in the range of nanoseconds because of their relatively high setup and communication times.
Zhou and Dinavahi [37] exploited the inherent parallelism in GPUs to provide the computational power needed for an electromagnetic transient simulation of large-scale power systems. They developed a chain of many parallel unified modules of power system elements by considering GPU architecture and programming language (e.g., CUDA) thread hierarchy to make every core in the GPU work efficiently. The computational efficiency of the proposed algorithm on GPU over CPU-based simulation was proved by expanding the IEEE 39 bus system several folds. The inherent parallelism feature of GPUs is also exploited in state estimation of large-scale power systems [162] , where it yielded 38× faster simulation over CPU-based software simulation. Adzima et al. [35] investigated the applicability of GPUs in the numerical field analysis of electric machines. They carried out a 2-D analysis of an inductor using the boundary-element method algorithm, achieving a moderate speed gain over software simulation.
IV. PERFORMANCE EVALUATION OF DIFFERENT HARDWARE PLATFORMS
The choice of hardware platforms for real-time simulation [28] is based on their performance, programmability, and application constraints (e.g., timing or model size). The performance [163] depends on various factors, such as architectural design and programming style of the platform used. The machine model on itself does not primarily govern the choice of the platform to be used. Rather, the system size and requirements determine this choice. In real-time applications, the platform performance can be assessed by the computational time step and accuracy, i.e., to what extent the simulated model resembles the real system.
In [33] , real-time simulation of the d-q models of induction motor, synchronous generator, line-started PMSM, and dc motor have been carried out on an FPGA. As expected, given the model size, dc motors had the smallest time step size and consumed least hardware resources while synchronous generators had the largest. In [10] and [33] , the d-q and MEC model of an induction motor have been simulated, in real time, on an FPGA. As expected, the MEC model gave more accurate results, but required 228 times larger step size and more hardware resources than those of the d-q model.
FPGAs and GPUs both bring about a significant gain in simulation speed compared with CPU-based software simulation as given in Table II . However, it is observed that only for large models involving tens of thousands of ODEs, a GPU performs better than a CMP. For smaller models, CMPs are more computationally efficient as the GPU's gain in raw processing power, brought about by the manifold increase in the number cores, is offset by its relatively high setup and communication Table 1 ]. times. Such results are found for power systems [37] , [177] , electromagnetic field analysis [167] , [173] , [180] , power converter simulation [174] , and biophysical systems [166] , [181] . For example, in a simulation of a power system containing 10 generators with 39 buses, the CPU with four cores gave faster results only when the power system scale was less than 1.7, as shown in Fig. 10 [177] . Here, the system scale refers to the original power system being duplicated with the scale factor (e.g., a system scale of two means that that the original model order is doubled). The GPU achieved a maximum speedup of 10× at a system scale of 256. A hybrid configuration (CMP + GPU) is also found in [161] and [183] - [185] . The power system in [177] for multiple scales was simulated in a GPU, a quad-core CPU, and the hybrid GPU + CMP platforms in [160] . It was observed that the CMP-only implementation ran faster than other platforms when the power system scale was less than two. For the system scales between 2 and 16, the hybrid configuration gave faster results. However, for very large scales (>16), the speed gain achieved by the GPU became more pronounced.
As observed in Table II , FPGAs achieve speedup of three orders of magnitude, while GPUs achieve speedup of one [182] order of magnitude for different types of machines. In fact, the acceleration gain depends on a number of parameters such as the model order, computational capability of the hardware, and simulation time step. In [185] , Gilbert's algorithm on support vector machine training was implemented on both FPGA and GPU. It was observed that FPGAs had a superior gain over GPUs (∼10×) for different training data set sizes, resulting from the relatively high data transfer time of the GPU. However, in the pipelined implementation of an FPGA, this speed gain worsens significantly as the data dimensionality increases. This is typically the case for very large systems, which cannot be accommodated in a single FPGA, as shown in Fig. 11 . Table III summarizes the execution times for dynamic systems of various sizes implemented on the FPGA, GPU, and CMP in [36] and [181] . The results show that the FPGA implementation led to an average of 4× faster simulation over both GPU and CMP. GPU programs, in general, tend to be very time efficient compared with FPGA programs. This is because programming in GPU is done using user-friendly C-like languages such as CUDA, while programming in FPGA is done using complex and verbose languages such as VHDL. However, bitwise operations can be quickly executed in VHDL. Hence, FPGAs are more efficient when simulating applications involving plenty of bitwise calculations. GPUs are also more efficient for floating-point operations as FPGAs require more hardware resources and deep pipelining [163] . However, for double-precision computation, CMPs outperform GPUs in terms of simulation speed [35] . In applications requiring frequent access to memory, GPUs are a less attractive option due to their relatively high communication time. In such cases, FPGAs can be used by taking advantage of their data flow streaming capability [163] . Table IV summarizes the main features of the hardware platforms used in real-time simulation. FPGAs have been largely used in real-time simulation of electric machines [29] , [85] , [144] , given their high clock speed, reconfigurable scheme, parallelism, and low bus latency. However, very high model orders cannot be efficiently implemented on FPGAs due to their limited hardware resources. To overcome this obstacle, pipelined realization schemes [33] and multiple FPGAs [122] approach have been proposed, but they can be costly. As the model order increases, GPUs, computer clusters, and CMPs become prominent due to their scalability. In these platforms, large-scale power systems [26] , [105] , [175] have been efficiently implemented in the range of microseconds. Moreover, latency and synchronization among computational nodes are a challenge for computer clusters compared with GPUs and CMPs [104] , [177] . As the model order increases further [37] , GPUs outperform CMPs in execution time. The large I/O latency is the main drawback in both platforms. From machine modeling perspective, lumped parameter models have been implemented on both FPGA and CPU-based simulators. Since the lumped parameter models are not computationally intensive, GPUs are not considered to be the first choice simulator due to their high setup and communication times. For higher fidelity models, GPUs and FPGAs have been primarily used to provide the necessary computational power due to their high number of computing units. In such cases, the large number of computations involved is a burden on simulation time for CPU-based simulators.
V. CONCLUSION In this paper, the hardware platforms common in realtime simulation of electric machines have been reviewed. Real-time simulation allows an engineer to realistically test newly designed controllers and machine drives over a wide range of conditions, in a nondestructive environment. The study objective determines the choice of appropriate machine models. d-q models are widely used in real-time simulation given their simplified nature. They result in faster execution times but lack model fidelity. Physics-based models, i.e., FEM and MEC models, are more accurate but are computationally intensive. CMPs and computer clusters have been used as the classical choice for real-time simulation involving d-q models. However, their high latencies prevent them from achieving computation time step sizes in the range of nanoseconds, thus effectively disqualifying them from simulating high-frequency PWM-based drives. FPGAs emerge as an alternative; the ultralow latency allows FPGAs to achieve nanosecond-range computation time steps. However, because FPGAs suffer from limited hardware resources, GPUs become prominent in the large-scale numerical field analysis of electric machines. Coding in a GPU is also more time efficient compared with that in an FPGA. However, the relatively high setup time makes GPUs less attractive in simulating small model orders. GPUs and FPGAs are shown to achieve simulation speed gains of one and three orders of magnitude, respectively, over CMPs. Given the lack of a unified test bed simulated on all platforms to quantify the platform performance in terms of simulation fidelity and speed, our future study will be directed along this perspective.
