ABSTRACT Electromagnetic transients (EMT) simulation is the most accurate and intensive computation for power systems. Past research has shown the potential of accelerating such simulations using graphics processing units (GPUs). In this paper, an efficient GPU-based parallel EMT simulator is designed. Threadoriented model transformations are first proposed for the electrical and control systems. Following the transformations, the electrical system is represented by connected networks of massive primitive electrical elements, the computations of which can be constructed as massive fused multiply-add operations and solutions to a linear equation. The control systems are represented by a layered directed acyclic graph with primitive control elements that can be dealt with using single-instruction-multiple-threads groups. Finally, code automation tools are designed to form the GPU kernels. Compared with past work, the proposed model transformations improve the degree of parallelism. Most importantly, the code automation tools improve computational efficiency by substantially reducing addressing and memory access, and render the implementation of the algorithm more general and convenient. Test systems of different sizes were created by connecting multiple IEEE 33-bus distribution systems and adding distributed generators. Simulations were performed on NVIDIA's K20x and P100 cards. The results indicate that the proposed method significantly accelerates EMT simulations compared with a CPU-based program. Real-time performance was also achieved under certain conditions. INDEX TERMS Electromagnetic transients simulation, EMTP, GPU, parallel computing, power system.
I. INTRODUCTION
Electromagnetic transients (EMT) simulation [1] is the most accurate tool to describe the fast dynamics of power systems. The widely used EMT simulation algorithm, also known as the EMTP-type program, has been rapidly developed in last few decades, and has been applied to several simulation software, e.g., ATP-EMTP [2] and PSCAD [3] . However, with the rapid development of smart grid technologies, complicated devices such as power electronic converters (PECs) and distributed energy resources (DERs), advanced controls are integrated into power systems. Such integration converts the grid into a large-scale, strongly coupled system. Consequently, EMT simulations for such systems are time consuming. This restricts fast simulation-based applications, such as real-time control strategy validations and closed-loop simulations. A considerable amount of research has been invested in building parallel computing-based EMTP-type programs using state-of-the-art multi-core processors [4] , PC clusters [5] , and other heterogeneous platforms. These parallel simulations are implemented mainly by network partitionings [6] - [8] that simulate different subsystems on computing devices. However, the above parallel techniques are coarse-grained ones, with limited degrees of parallelism inside each partition.
In recent years, owing to the insurmountable conflicts between increase in clock speed and heat removal [9] , it has become challenging to maintain enhancements in the performance of CPUs at pace with that predicted by Moore's law. Growth in computing capability relies on many-core heterogeneous devices, such as graphics processing units (GPUs). The GPU was originally designed for image processing. Since NVIDIA developed the Compute Unified Device Architecture (CUDA) [10] - [12] , GPUs have been used for general-purpose scientific computing in many areas, such as biology, geology, and deep learning [13] , [14] .
In the area of power system simulations, the GPU was first used to accelerate multi-scenario power flow-based contingency analysis [15] . Since then, GPUs have been used to improve the performance of many power system applications, such as the parallel power flow solver [16] , parallel transient stability analysis [17] , and parallel EMTP-type simulators [18] , [19] .
Earlier GPU-based parallel EMTP-type simulators simply regarded the GPU as an efficient linear algebra equation solver for use during the network solution process while the rest of the computations were still performed on the CPU. However, frequent communication between the CPU and the GPU is required, which limits overall performance. Researchers thus began developing full GPU-based EMTPtype simulators where all computations are performed on the GPU, with minimal data exchange between it and the CPU. In [18] , different GPU kernels were designed for various electrical components, such as linear passive elements, the universal transmission line model, and the universal machine model. A general offline EMTP-type program on a GPU has been developed and verified as highly efficient without compromising accuracy, but there is still a lack of GPU-based control system simulations. In our preliminary work [20] , a full GPU-based EMTP-type simulator was designed by considering the simulation of both electrical and large-scale control systems. A layered directed acyclic graph based computational model was applied to process simulations for control systems. The algorithm in [20] has been applied to wind farm simulations and yielded considerable improvement in computational speed [21] .
The above work has made significant contributions to the design of a full GPU-based EMTP-type simulator. However, two major drawbacks persist. First, past work has not improved approaches to modeling power systems to render them more compatible with GPU computing, which may lead to load imbalance and thread heterogeneity while solving for electrical systems. Second, previously designed kernels on GPUs [20] are inadequate as they are not applicable to all kinds of complex control systems. Moreover, multi-layer addressing and a large number of memory accesses are introduced, and computational efficiency is thus severely limited by memory bandwidth.
In this paper, a full GPU-based thread-oriented parallel EMTP-type simulator is designed, and makes two main contributions:
• A thread-oriented normalized transformation is proposed to represent the electrical system as composed of a network with only primitive electrical elements, the calculations of which are basic fused multiply-add (FMA) operations, with the same instructions and equal amounts of computation for each threads. Therefore, such transformation can further improve the degree of parallelism (DOP) of the electrical system without increasing computations.
• Code automation tools are designed to generate the kernels for both the electrical system and the control system according to a case study at runtime. Such tools reduce addressing and memory access significantly, and obtain higher kernel efficiency.
The proposed simulator was verified to induce significant accelerations in both traditional power systems and AC-DC hybrid grid simulations, and can be used as a real-time simulator for power systems at certain scales.
The remainder of this paper is organized as follows: In Section II, the challenges of designing a high-performance GPU-based EMTP-type simulator are discussed. The proposed thread-oriented parallel EMTP-type simulator is described in Section III. A case study is presented and discussed in Section IV, and the conclusions of this work are given in Section V.
II. CHALLENGES OF FULL GPU-BASED EMTP-TYPE SIMULATOR

A. GENERAL-PURPOSE COMPUTING ON GPUS
A CUDA-enabled GPU is a many-core processor composed of arithmetic logic units (ALUs), shared memory, and control units [10] . A parallel program executed on a GPU is referred to as a kernel, and usually launches a large number of threads. To better use computing resources on a GPU, a multi-level execution model has been widely used to schedule computational tasks on it, as shown in Fig. 1 .
A thread grid launched by a kernel is divided into thread blocks, each containing no more than 1,024 threads. Blocks are distributed to different streaming multiprocessors (SMs). It should be noted that only threads within a block can exchange data through the shared memory. Blocks on an SM are split into thread warps, and each warp contains no more than 32 threads. Warp schedulers inside an SM allocate each thread warp to a core array. Threads inside a warp execute in single-instruction-multiple-threads (SIMT) pattern. If the instruction diverges inside a warp, threads corresponding to different instructions are executed serially.
As a result of the above, the GPU is designed for dataintensive, fine-grained parallel computing tasks. The principles of designing high-performance parallel algorithms on GPUs can be found in [10] and [11] . The following are some important principles that we followed in the design of the VOLUME 6, 2018 FIGURE 1. Multi-level parallelism on GPU.
full GPU-based EMTP-type simulator: 1) eliminating divergences inside a warp to minimize latency; 2) avoiding conditional statements and memory access conflicts to maximize instruction and memory throughputs; and 3) transforming computations into FMA operations, i.e., z = ax + y, which is the most efficient operation on the GPU.
B. CHALLENGES OF DESIGNING GPU-BASED EMTP-TYPE PROGRAM
In power system simulations, the EMT simulation model for electrical components and controls can be represented as a set of differential equations. Therefore, the EMT simulation performs numerical integration on the differential equations. In general commercial software, there are two kinds of EMT modeling and simulation frameworks, i.e., state space analysis and nodal analysis. In general, the nodal analysis framework is more efficient as the topology of the power system does not change frequently and the factorization process of the nodal matrix can be eliminated in many cases [20] .
The widely used nodal analysis framework for power system transient simulations, also known as the EMTP-type program, carries out the numerical integration of all models in the power system. As shown in Fig. 2 , such simulations consist of three serial steps [22] in each integration step: 1) Solve the control systems to obtain inputs for the electrical components. 2) Update the status of the electrical components using numerical integration rules. 3) Solve network voltage equations. Typically, step 3) can be implemented using various algebraic solvers on the GPU, such as the GLU [23] and cuSOLVER [10] . Therefore, it is not the main issue examined in this paper. In summary, to develop a full GPU-based EMTP-type simulator, parallel implementations of steps 1) and 2) should be carefully discussed. 
1) REDUCING THREAD HETEROGENEITY IN ELECTRICAL SYSTEM
In EMTP-type programs, step 2) calculates the electrical components. By applying numerical integration rules to the differential equations of an electrical component, discrete Norton equivalent circuits are obtained [22] . Typically, an mport electrical component is modeled by
where t is time, I, U ∈ R m represent the port's current and voltage vectors, respectively, G ∈ R m×m represents the Norton equivalent conductance matrix, and I ne ∈ R m represents the Norton equivalent current vector, which is calculated by
where P, Q ∈ R m×m denote the coefficient matrices, and C ∈ R m reflects the effect of control inputs on the electrical component.
A straightforward way to parallelize (1) and (2) on a GPU is to launch m threads, each of which performs the following computation for one element:
where i ne,k and i k are the k th elements of I ne and I, respectively. This approach was used in our prior work [20] . A similar approach has been reported in [18] , where specific kernels were designed for different electrical components. Driven by (3), the computation task assigned to each thread changes with the number of ports and type of corresponding component. Thus, heterogeneous threads are required and launched on the GPU. If threads within a warp need to execute different instructions, they are executed serially, resulting in a reduction in the DOP. This can be partially fixed by distributing threads associated with different components to different warps. However, loading imbalance among threads within distinct warps still exists: 1) the undesirable wait time owing to unbalanced loading can be considerable; and 2) thread-core mapping depends on the number of each type of component, which varies by case. As a result, the performance of the GPU-based simulation also varies with case.
Thus, mitigating the heterogeneity of threads is key to high-performance electrical networks simulations on GPUs.
2) EFFICIENT KERNEL DESIGN FOR COMPLICATED CONTROL SYSTEM
In general, a control system can be represented by a block diagram where transfer functions and nonlinear blocks are connected in certain topologies. To derive the final control signals, a separate thread on the GPU can be assigned to process each control component. Each thread calculates the outputs of the corresponding control component according to its inputs. However, for control components connected in series, the outputs of upstream components are used as inputs for immediately downstream ones. As a result, threads associated with the downstream components cannot perform calculations until the upstream component is solved. As shown in Fig. 3 (a) , operations 4 and 5 cannot be processed until they receive the results of operations 1, 2, and 3. This results in undesirable wait times and communication among threads on the GPU [20] .
Thus, in our previous study, a competition mechanism was designed to manage communication and synchronization among threads [19] , as shown in Fig. 3 (b) . However, competition among threads relies on frequent memory read operations and conditional judgments through atomic functions, which lead to inefficient memory access and reduced DOP. For a power system with thousands of control components, such as a wind farm, this causes bottlenecks that degrade the efficiency and scalability of GPU-based simulations.
In [20] , we subsequently proposed the LDAG computational model for GPU-based control system simulation. The model represents the complex control system as a layered directed acyclic graph (LDAG) that can be handled by layered and grouped SIMT groups on the GPU. The LDAG model removes the random competition mechanism among threads, using a while loop to control the layer iterator. However, such a kernel design has two drawbacks that restrict its utilization.
• If new primitive control elements (e.g., user-defined models) are used, the kernel should be redesigned to include new instructions. Similarly, the kernel design for electrical system faces such challenges.
• As the layer is controlled by the while loop, layer information can only be regarded as input data, which significantly increases addressing and memory access time. Therefore, efficient kernel design for the EMT simulation should avoid thread competition, excessive addressing and memory access.
III. THREAD-ORIENTED PARALLEL EMTP-TYPE SIMULATOR ON GPU
As shown in Fig. 4 , the framework of the thread-oriented parallel EMTP-type simulator features the following three steps:
• To obtain GPU-friendly computational models, threadoriented transformations are performed on the electrical and control systems. Specifically, the electrical components are represented by a set of primitive electrical elements and the control system is modeled as a layered DAG of primitive control instructions, as in [20] .
• Thread structures for both the electrical and the control computational models are determined. For the electrical system simulation, all computations can be dealt with by an SIMT group for FMA operations and a linear equations solver. For the control system, computations of the LDAG model can be processed by SIMT groups with synchronization between layers. 
FIGURE 4.
Framework of the thread-oriented parallel EMTP-type simulator.
• To build efficient simulation kernels on the GPU, automatic code generation tools are designed for both the electrical system and the control system. For a specific power system, the kernel code is automatically generated and compiled once before the simulation.
A. ELECTRICAL SYSTEM SIMULATION 1) THREAD-ORIENTED NORMALIZED TRANSFORMATION AND COMPUTING FOR ELECTRICAL COMPONENTS
As mentioned in Section 2.2, before applying numerical integration rules to form the Norton equivalent circuit in (1), the m-port electrical component was originally modeled by a set of differential equations, i.e.,
where x ∈ R n is the state vector, u ∈ R m represents the input vector, y ∈ R m represents the output vector, and A ∈ R n×n , B ∈ R n×m , C ∈ R m×n , and D ∈ R m×m are coefficient matrices. To generate homogeneous threads for the parallel computing of (4), an expansion of the state space is adopted, and can be described by a linear transformation:
where z ∈ R p , (p n) is the expanded state vector, v ∈ R q , (q m) represents a vector of expanded input variables, and M ∈ R n×p , K ∈ R p×q and N ∈ R m×p are linear transformation matrices. Then, the differential equation model of the electrical component in (4) is transferred to the expanded one given by (6):
Then, threads can be created for the integration of z i ∈ z (i = 1, · · · , p). The coefficient matrices have the following features: 1) Each row of matrix K contains only one nonzero element. 2) Matrix N contains only 0, 1, and -1. 3) p and q should be minimized under constraints 1) and 2). The benefit of the above computational model transformation is twofold.
(i) It uniformly represents the electrical components by primitive elements. As only one nonzero element exists in each row of K, and the dynamic of a state variable z i depends only on an input variable v j . There are several types of primitive electrical elements, as listed in Table I , whose dynamics can also be described by a one-dimension differential equation with one input variable. Thus, following the transformation, the original model of an m-port electrical component is reshaped into a composition of models of primitive electrical elements.
In Table 1 , branch current i b and Norton equivalent current i ne are calculated as
(ii) Homogeneous threads are created for primitive electrical elements. Using the computational model in (6) , an m-port electrical component is represented by p primitive electrical elements. For each electrical element, a thread on the GPU can be created to execute a two-step computation task repetitively.
• Step2.1: The Norton equivalent current of a primitive electrical element is calculated. Using the trapezoidal rule, the integration of the state equation of a primitive element can be transformed into a Norton equivalent equation. Then, the Norton equivalent current for the electrical elements is derived as given at the bottom of Table 1 . With the given node voltages from the previous step, the Norton equivalent current of the different electrical element can be calculated independently by the thread.
• Step2.2: The node-injected currents are updated by all threads together. As N only contains 1, 0, and -1, the matrix vector products for computing y, which is the node-injected current vector, requires only accumulative instructions of z. Such an operation can also be performed by a thread for an electrical element with atomic functions.
25728 VOLUME 6, 2018 FIGURE 5. Thread structure of electrical system kernel.
The thread structure is shown in Fig. 5 . For calculating branch currents and Norton equivalent currents, p threads perform FMA operations in the SIMT pattern, and these are one of the most efficient operations on the GPU. Then, each thread adds its result z i (Norton equivalent current) to the corresponding y ∈ y (node-injected current) in shared memory using atomic functions. Following synchronization that ensures all node-injected currents are available, the nodal voltage equations can be solved to obtain the system status of the given time step.
2) ILLUSTRATIVE CASE: THREAD-ORIENTED TRANSFORMATION FOR A SINGLE-PHASE TRANSFORMER
As an example, the proposed computational model transformation was applied to a basic single-phase transformer, the conventional computational model of which is shown in (7) .
where i 1 
Then, the computation model of the basic transformer can also be illustrated as the equivalent circuit shown in Fig. 6 . It is evident that M in (8) is the node-branch incident matrix of the derived circuit, which makes the proposed model transformation physically meaningful. By carefully choosing a matrix for model transformations, electrical components such as machines and transmission lines can be described as compositions of primitive electrical elements in Table 1 . Then, they can be processed by grouped homogeneous threads on the GPU.
3) CODE AUTOMATION TOOL FOR ELECTRICAL SYSTEM KERNEL
Benefiting from the normalized transformation of all electrical components, a unified electrical system kernel can be formed automatically to update the status of all electrical components and solve the nodal voltage equations for the entire electrical system. The GPU code for the electrical system kernel is generated according to Algorithm 1.
From steps 2 to 7, the normalized transformation of all electrical components are completed. As a result, each component C i is represented by p i primitive elements with Norton equivalent conductance G bi , and coefficients A i and B i to calculate historical current. Correspondingly, the computations of C i are processed by p i concurrent threads in the SIMT pattern as in Fig. 5 . The parameters of all primitive elements are saved in three arrays G b , A, and B in global memory on the GPU.
Then, in step 8, the GPU kernel code to process 3) Update Norton equivalent currents I ne,b with parameters A and B. In particular, the output of control system I c is assigned to the corresponding control sources as their I ne,b directly. 4) Once I ne,b is updated, node-injected currents I n are assembled by atomic functions (atomicAdd) on shared memory, after which _syncthreads() is invoked to synchronize the previous thread-based operations.
Finally, in step 9, the code for solving nodal voltage equations is assembled into the GPU kernel after the thread synchronization command. In this work, the nodal voltage equations required for the EMTP-type program are solved using flexible approaches according to system scale. By choosing proper models of the electrical components, the nodal conductance matrix can be kept constant when the topology of the electrical system is consistent. Thus, for small-scale networks, the inverse of the nodal conductance matrix can be calculated and stored before the simulation, and only efficient dense matrix-vector productions (gemv) are performed in each step of the integration to obtain the nodal voltages. However, for larger systems, massive areas of area and amounts of computation are required for such dense matrices, which reduces efficiency. An alternative is to use the sparse direct solver with the right-looking LU factorization method [23] for such large-scale power systems. In this approach, sparse factorization can be processed before the simulation, and only sparse forward/backward substitutions are performed in each step of the integration to obtain the nodal voltages.
B. TRANSIENT SIMULATION OF CONTROL SYSTEM 1) LDAG TRANSFORMATION FOR LAYERED CONTROL TOPOLOGY
To implement the control system simulation on the GPU efficiently, the tasks of the threads should be carefully defined and scheduled to avoid conflicts, and to optimize the threads and shared memory utilization. In this study, after being decoupled and ordered, computational instructions related to control functions are linked by a layered topology. Grouped threads then execute instructions within the same layer in SIMT pattern.
(i) Layered transformation of control system As in primitive electrical elements, a primitive control instruction is defined, the computations of which are challenging or unnecessary to decompose further. In the case study used in this paper, primitive control instructions included comparators, PI controllers, and basic math functions.
In general, a control system can be modeled by a directed graph as shown in Fig. 7 , where V (node) stands for primitive control instructions and E (edge) represents control signal flows. To avoid algebraic iterations in particular, algebraic feedback loops are unlocked by inserting a time-step delay. Thus, a control system is described as a DAG. Consequently, primitive control instructions need to be calculated in the correct sequence. The corresponding threads for each instruction are thus causally related, and need to be scheduled carefully to enhance DOP. The topology of the control system is layered to determine the workflow of the grouped threads.
The longest path-searching algorithm proposed in [24] , [25] is used to generate layers for the control system DAG, which discovers the layer set
Then, threads for instructions in the same layer can be launched simultaneously without data exchanges among them. Following synchronization to guarantee the availability of all inputs, unoccupied threads pick up instructions for the next layer and process them concurrently.
(ii) Grouped threads working in SIMT pattern For a large power system, control system simulations on a GPU need to handle a large number of controls for different electrical components in one time step. These variant control systems can be uniformly described by a layered DAG, where primitive control instructions are considered nodes. According to their types and content, control instructions in layer L i can be divided into n i groups. Then, threads processing instructions in a group share the same operations and work in the SIMT pattern. By allocating grouped threads to different core arrays, cross-group parallel computing can be achieved, which makes the control system simulation fully parallelized on the GPU. The workflow of the grouped threads is illustrated in Fig. 8 . As shown in Fig. 8, synchronization is required between the layers to make sure all inputs for the next layer are ready. Grouped threads are reusable to compute all layers, which can reduce the number of threads used in a block. Moreover, shared memory is used to cache data transferred between layers. As the number of inputs of each layer is much smaller than their summation, excessive use of shared memory is effectively avoided. Moreover, shared memory allocated to a block can be reused for data caching between layers, which makes optimal use of computational resources and enhances the scalability of control system simulations on GPUs.
It should be noted that grouped threads working for different control systems can process the same instructions in SIMT pattern, which increases the use of shared memory. If the shared memory is not sufficient in size for data exchanges required by the control system simulation, the global memory can be used to enable communication between layers. This supplemental solution eliminates the scale limit of the control systems simulated on the GPU.
2) ILLUSTRATIVE CASE: LDAG TRANSFORMATION FOR A V-Q CONTROLLER
In the left part of Fig. 9 , part of the basic V-Q controller for power converters is illustrated in a block diagram containing six types of primitive control instructions. After transference to a DAG, the instructions associated are redistributed to five layers, which can be processed by grouped threads. Only four threads are involved in such solutions. In layer 1, two groups of threads work together for two types of instructions. Then, two threads are reused to handle instructions in the following layers, which contain only one type of instruction in each layer.
3) CODE AUTOMATION TOOL FOR THE CONTROL SYSTEM KERNEL
As with the electrical system, an automatic GPU code generator is designed to form the control system simulation kernel according to the LDAG computational model.
The code generation procedure is as shown in Algorithm 2.
In step 2, the control system simulation model is transformed into several DAG-based computational models. Steps 3 and 4 further decompose the DAG into independent subgraphs and cluster them into n groups according to structure. In this way, different DAG groups (G i ) can be processed in different thread blocks to achieve coarse-grained parallelism.
Algorithm 2 Code Generation for Control System Kernel
1: Initialize an empty thread grid with 0 threads (Th = 0); 2: Control system model → DAG computational model G; 3: Split G into m independent DAGs; 4: Form n DAG groups G i , (i = 1, · · · , n) by DAG isomorphism judgment; 5: Set n thread blocks for the control system kernel. 6: while i < n do 7: Set blockIdx = i − 1; 8: Layer the elements of G i into h i layers; 9: while j < h do 10: Group elements in layer L i,j into k i,j groups; 11: Generate k i,j SIMT groups for layer L i,j ; 12: Generate synchronization codes for layer L i,j ; 13: j + 1 → j; 14: end while 15: i + 1 → i; 16: end while 17: Generate synchronization codes for all threads; 18: Compile Moreover, DAGs inside a group are isomorphic, and thus can benefit from the fine-grained SIMT parallelism.
From steps 5 to 16, the GPU kernel code to process Fig. 8 is automatically generated by interpreting the DAG computational model according to the following steps: 1) Create n thread blocks to handle n DAG groups. 2) Allocate a shared memory area for each thread block to cache all node results of each DAG group G i . 3) Layer the elements (control instructions) in a DAG group G i into h i layers. Then, inside each thread block, classify the control instructions and process them on different SIMT thread groups from layer 1 to h i . It should be noted that the number of threads for each SIMT group should be specified according to the computing resource allocation principle in [20] to obtain optimal efficiency. 4) Synchronization is required between layers. Once the synchronization for layer h i is complete, outputs of the control system are copied back to global memory I c , which updates the Norton equivalent currents of the controlled sources.
Finally, in step 17, to ensure that the final results are ready, code for synchronizing all thread blocks is assembled into the control system kernel. The results can be used to solve electrical systems in the electrical system kernel.
Once codes for the electrical system and the control system kernel have been created, the NVCC compiler [10] is called to compile the final GPU program to form the binary execution file.
4) MERITS OF THE AUTO-GENERATED GPU CODES
In Fig. 10 , two kernel structures are provided for comparison. Compared with the loop-based unified kernel in [20] ( Fig. 10(a) ), the automatically generated kernel in this paper VOLUME 6, 2018 ( Fig. 10(b) ) has the following three advantages:
• The operation for each thread in (Fig. 10(b) ) is fixed, where addressing and memory access for instructions are eliminated.
• The previously proposed kernel needs to maintain a general framework to make it compatible with all primitive instructions within all layers. As a result, only a part of the threads can be used for each layer, as additional thread groups should be reserved for other primitive instructions. On the contrary, all threads and more computing resources are available in each layer in (Fig. 10(b) ), which enhances the efficiency and the scalability of the simulation kernel.
• The automatically generated kernel unrolls the while loop in (Fig. 10(a) ) by writing layered data into a binary execution file, which reduces the number of conditional statements and instruction scheduling time. Such a space-time trade-off optimizes execution speed.
IV. TESTS AND DISCUSSION
A. TESTING ENVIRONMENT
The results of the performance of the thread-oriented EMTP-type simulator on a GPU (denoted by P4) were compared with those using PSCAD/EMTDC (a commercial EMTP-type simulator) and other EMTP-type programs denoted by P1 P3.
1) EMTP-type program on a multi-core CPU (P1) that uses the Intel Math Kernel Library (MKL) [26] to solve nodal voltage equations; 2) Hybrid CPU-GPU EMTP-type program (P2), which only solves nodal voltage equations on GPU; and 3) Fine-grained EMTP-type program (P3) fully implemented on a GPU with loop-based control system kernel [20] . The computing devices (CPU, GPU 1 and GPU 2) used in the above programs are provided in Table 2 .
B. TEST CASES
In the following tests, the IEEE 33-bus distribution network was selected as the basic case. Constant RLC loads were connected through three-phase transformers. Large-scale cases are formed by connecting basic cases as in Fig. 11 .
Moreover, PV cells were connected to different buses through inverters and a three-phase transformer, as shown in Fig. 12 . Without loss of generality, a PV cell was modeled as a source of controlled current while the grid side inverter operated in Q-V control mode. The reference voltage on the DC side was determined by MPPT control based on a perturbation and observation method proposed in [27] . To maintain the system's conductance matrix constant, an average value model (AVM) of the converter was adopted in this work, the details of which can be found in [28] .
After transforming models of the electrical and control systems into thread-oriented form, the computational scales of the test cases can be estimated as in Table 3 (N e is the number of basic cases and N c the number of PV cells).
C. ACCURACY VALIDATION
We set N e = N c = 1 and connected PV cells to bus 17. Transients of test system were simulated with the following conditions. The results in Fig. 13 reveal consistency between the commercial EMTP-type program and the parallel EMTPtype program, which verifies the correctness of the proposed thread-oriented modeling as well as the accuracy of parallel simulations using GPU kernels.
D. EFFICIENCY TESTS
To test the efficiency of the thread-oriented parallel simulation on a GPU, detailed time costs of the electrical and control system simulations were measured separately by the NVIDIA Profiler. Similar time costs were measured for P1 P3. Then, the acceleration performance of the proposed method (P4) was evaluated thoroughly.
1) TIME COSTS OF ELECTRICAL SYSTEM SIMULATION
We set N c = 0 and let N e increase from one to 30. The electrical systems were scaled up from 195 single-phase nodes with 33 three-phase transformers to 5,763 single-phase nodes with 990 three-phase transformers.
The maximum time cost per integration step for the EMT simulation was measured for the test systems using P1~P4, and was used to solve the nodal voltage equations (T1) and form node-injecting currents of the electrical components (T2). When N e < 10, dense matrix-vector multiplication was used to calculate node voltage; otherwise, sparse backward/forward substitutions were applied. As all factorizations were had been completed and stored before the simulation started, factorization time was not included in T1. Comparisons of time costs per time step between simulation programs are given in Table 4 .
As the electrical system was scaled up, the acceleration ratios of P2, P3, and P4 continued to grow compared those in P1. The CPU-GPU program (P2) showed very limited speedups as the linear algebraic solvers on the GPU did not exhibit superiority for such small system scales. However, the full GPU-based programs (P3 and P4) accelerated the simulations for all test cases.
Moreover, compared with the parallel EMTP-type program (P3) proposed in [20] , the newly designed threadoriented EMTP-type simulator (P4) gained more speedups for all test cases owing to the saved time in T2. Such speedups benefit from transforming the electrical system into combinations of primitive electrical elements. As shown in Table 4 , program P4 created more threads for electrical component simulations than P3. Moreover, as only one SIMT group was required, P4 had a higher DOP than P3 when calculating electrical components (T2 period). By contrast, in P3, different electrical components were processed in different SIMT groups, the number of instructions of which varied. It is evident as thread homogeneity for electrical components increased, computing resources could be evenly utilized, which remarkably enhanced the performance of the parallel EMT simulation.
Furthermore, two types of GPU devices were used in efficiency tests in this paper, i.e., the K20x (Kepler architecture) and P100 (Pascal architecture). As new architecture was enabled and increased bandwidth, frequency, and the number of cores, the proposed kernels performed better on the P100 platform according to the test results. As shown in Table IV , all time costs recorded on the P100 were smaller than the corresponding ones on the K20x for the electrical system simulation. During the network solution process (T1), the P100 platform showed a smaller speedup (1.1x) than the K20x platform for large-scale systems, as the DOP for the traditional direct solver was relatively poor. For the welldesigned electrical system kernel used to calculate the electrical components, the P100 platform recorded a speedup of more than 1.8x compared with the K20x.
2) TIME COSTS OF CONTROL SYSTEM SIMULATION
We created test cases where N e = 1 and let N c increase from one to 300. For the case with the largest number of PV cells, 23,400 (78 x 300) primitive control instructions needed to be processed in each time step. As in tests for the electrical systems, the maximum time cost of the control system simulation per step was measured for P1 P4 both on the CPU and the GPU, as shown in Table 5 . Table 5 shows that for cases with large-scale control systems, P3 and P4 on the GPU gained more significant accelerations than those using the P1 and P2 on a CPU.
Moreover, P4 achieved an almost 1.5x to 2x speedup compared with P3. As shown in Section III-B-3), by using the code automation tool, the while loop in P3 was unrolled, and excessive addressing and memory accesses were eliminated. Moreover, as observed in Table 5 , 15 concurrent thread groups per block (GPB) were used for each PV control system in P4. However, only nine concurrent thread groups were available in P3 because the kernel in it was forced to use an excessive number of threads to reserve other primitive instructions within all layers. Therefore, the computing resources for each layer were restricted, which limited the performance enhancement of control system simulation on the GPU. However, for P4, the above restrictions were eliminated, which rendered the control system kernel more efficient.
Moreover, for the control system simulation, the P100 platform (P4 -GPU2) was almost 2x faster than the K20x platform (P4 -GPU1). As higher clock frequency and memory bandwidth were enabled in the P100, the computation and communication processes between layers in the control system simulation became more efficient. Moreover, more SMs with larger shared memory were integrated, so that the device could handle more parallel independent control systems than the K20x platform. The above advantages make the P100 device feasible in real-time simulations, as discussed in 4.5.
E. SCALABILITY AND REAL-TIME PERFORMANCE 1) LIMITS OF SCALABILITY
As shown in Sections 4.3 and 4.4, the proposed threadoriented EMTP-type program on a GPU is feasible for simulating large-scale electrical and control systems.
For electrical system simulations, the available simulation scale was limited only by the size of the global memory, most of which was used to store the dense nodal resistance matrix Z of a small system or the sparse LU factorization results of a large-scale system.
For the control system, the maximum number of threads in one block limited the scale of the system simulated. As computations in a layer should be processed in a thread block, for a control system with more than 1,024 primitive control elements in each layer, the proposed algorithm will fail. One possible solution is to split the large LDAG into small graphs and distribute computations to multiple thread blocks.
2) TOTAL SPEEDUPS AND FEASIBILITY OF REAL-TIME SIMULATION
Real-time simulation is an important tool in studies on power systems. A series of typical test cases used in real-time simulations were built by choosing N e = 1 and N c ∈ [1, 300]. The total time costs of each integration step and speedups are given in Fig. 14, with comparisons with the performance of P1.
In addition clear acceleration on the K20x and P100, the real-time performance of the proposed algorithm can be viewed in Fig. 14 . In general, the integration step used in the real-time simulations should have been shorter than 200µs to guarantee accuracy and numerical stability. The real-time simulations using integration steps of 100µs and 50µs were chosen as benchmarks.
As observed in Fig. 14, for simulations with an integration step of 50µs, real-time simulations were implemented on the P100 with no more than 50 PV systems. If 100µs was used as integration step, real-time simulation for 100 PV systems could still be guaranteed. 
V. CONCLUSION
In this paper, a full GPU-based EMTP-type simulator was designed using thread-oriented modeling, and an algorithm for both electrical and control systems as well as code automation tools. According to accuracy and efficiency tests, the proposed simulator showed significant speedups for large-scale power system simulations compared with past research. Moreover, the latest NVIDIA P100 device with Pascal architecture was shown to be feasible in real-time simulation applications, such as closed-loop control system validation.
However, in practical EMT applications, the overall performance of the proposed simulator might still be restricted and should be improved further with the following two concerns.
• As only one GPU card has been used during the simulation, the speed-ups is mainly achieved by fine-grained parallel strategies on one GPU. However, for large-scale cases, higher efficiency may be obtained with the help of network partitioning technique and coarse-grained parallelism using multiple GPUs. Therefore, more attention needs to be devoted to parallel EMTP-type simulators on multi-GPU based platforms.
• The test environment in this paper is an ideal one, as only one simulation task occupies all computing resources during each test. Therefore, speed-ups might have been restricted when launching multiple simulation tasks concurrently. In future research, GPU-based parallel-batched simulations for multiple tasks should also be investigated to build advanced simulation applications.
