I. INTRODUCTION
Modular multilevel converters (MMCs) are an important topology being explored in various fields including high-voltage direct current (HVdc) transmission systems, high-power drives, solid-state transformers, and others [1] , [2] . They consist of 'N' submodules (SMs) connected in series to be able to produce the required voltages/currents. Their modularity and the ability to produce very low harmonics makes them attractive in various applications.
This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan(http://energy.gov/downloads/doe-publicaccess-plan).
An MMC will have at least 18N+6 states that will need to be simulated based on the number of switching signals, capacitor voltages, and arm currents. There will be more states that will need to be simulated based on the control implementation. For a HVdc application, 'N' is of the order of several hundreds to meet the dc-link voltage requirements. This results in several thousands of states that need to be simulated. Moreover, the presence of the switching elements introduces numerical stiffness in the differential algebraic equations (DAEs) representing the dynamics of the MMC states [3] . The switching/sampling time are small (of the order of microseconds), resulting in very small timesteps to simulate MMCs. These result in a large simulation time taken to complete the simulation of MMCs in HVdc applications. Equivalently, it becomes difficult to achieve real-time simulation of MMCs, where each real-time simulation time-step has to be completed within the same time.
Several real-time simulation platforms have been proposed for fast and hardware-in-the-loop simulation of MMCs that are either based on the expensive field-programmable gate arrays (FPGAs) or have not been able to simulate the high number of SMs present in HVdc applications in central processing units (CPUs) [4] - [7] . The limitation in the CPUbased implementation arises from the high computational floating point operations required at very small time-steps (of the order of microseconds).
In this paper, a high-performance computing (HPC) real-time simulation algorithm on a lowcost digital signal processor (DSP) is proposed to simulate the MMC dynamics and control states in real-time. The proposed real-time simulation algo- 
rithm overcomes the limitations mentioned about CPU-based implementations. A comparison of the proposed real-time MMC HPC solution with existing real-time MMC simulators is provided based on a quality measure that is defined. The quality measure is an indication of the quality of the realtime simulation algorithm. The proposed algorithm is critical for any HPC-based simulation and control of large-scale power electronics like MMCs and for hardware-in-the-loop evaluations.
II. MMC-HVDC SIMULATION ALGORITHM & CONTROL
The circuit diagram of a three-phase MMC is shown in Fig. 1 . It consists of 6 arms, each consisting of N series connected SMs and an inductor. The basics of operation of the MMC is explained in detail in [1] and is not repeated here.
The dynamics of the arm currents and SM capacitor voltages of the MMC of Fig. 1 are given by (1) and (2), where v sm,y,i,j is the output voltage of SMi in arm-y, phase-j [8] . They form a set of semiexplicit DAEs that represent the overall dynamics of the MMC. There is numerical stiffness observed in (1), which is discretized using algorithms with stiff-decay property like backward Euler. As no stiffness is observed in (2), it is discretized using non-stiff algorithms like forward Euler. A more detailed description of the aforementioned hybrid discretization algorithm based on the numerical stiffness associated with the DAEs can be found in [8] .
The control algorithms in MMC have been explained in [3] and are summarized in Figs. 2-3. The arm current control strategy is shown in Fig. 2 and the SM capacitor voltage balancing algorithm is shown in Fig. 3 .
The overall implementation of the MMC-HVdc simulation and control algorithms is summarized in Fig. 4 . In the figure, v c refers to the vector consisting of all the SM capacitor voltages, z n refers to the vector of all arm voltages, r arm refers to the vector consisting of R y,j , i refers to the vector containing all the arm currents, N on refers to the vector of III. HIGH-PERFORMANCE COMPUTING HARDWARE The high-performance computing (HPC) hardware chosen for real-time simulation is the Texas Instruments (TI) C6678. The chosen hardware is a HPC embedded system based on a 8-core floatingpoint digital signal processor (DSP). It can provide a maximum performance of 16 giga floating-point operations per second (GFLOPS) per core when the processor is running at 1.0 GHz [9] , resulting in a maximum of 128 GFLOPS. Each core contains two identical data paths and a total of eight functional units. The functional units in each data path include arithmetic logic unit .L, control .S, multiplier .M, and data .D that perform general arithmetic and logic instructions, memory access and address calculations, and multiplication operations.
Each core has a 32 KB L1P first-level program memory, 32 KB L1D first-level data memory, and 512 KB L2 second-level memory. Additionally, there is a 4 MB shared memory (called multi-core shared memory -MSM) and 8 GB double data rate type 3 (DDR-3) memory that is accessible to all the cores. There are high-speed peripherals like the twolane PCIe Gen2, four-lane SRIO 2.1, and hyperlink that are capable of supporting up to 50 Gbaud data transmission. The architecture of HPC hardware is summarized in Fig. 5 .
IV. REAL-TIME MMC-HVDC SIMULATION ALGORITHM A real-time simulation algorithm, based on the application of HPC techniques to the MMC simulation algorithms described in Section II, is developed in the HPC hardware. The HPC techniques are key to ensure the hard real-time constraints are satisfied.
A. HPC-Based Algorithm
The SM capacitor voltage system and the arm current system, as shown in Fig. 4 , need to be simulated in series. The series simulation requirement arises from the data exchange that happens at each time-step. The SM capacitor voltage system requires the arm current information and the arm current system requires the SM capacitor voltage information. Similarly, the arm current control and the SM capacitor voltage balancing strategies need to be simulated in series. However, the SM capacitor voltage system and arm current control can be simulated in parallel. The same can be said about the SM capacitor voltage balancing and the arm current system. Within each time-step, there are two parts. In the first part, cores 1 to 6 of the HPC hardware are assigned to simulate the SM capacitor voltage system of the six arms of the MMC. Core 7 executes the arm current control during the same time. In the second part, the SM capacitor voltage balancing algorithm is executed in cores 1 to 6, and the arm current system is simulated in core 7.
The parallel executions have to be synchronized with the serial executions to maintain data integrity before the execution of the second part. The OpenMP barrier algorithm writes back and invalidates all the shared variables. The data synchronization based on OpenMP barrier takes too long ( 3.5 μs), which reduces the capability to simulate large number of SMs in an MMC in real-time. The large time taken is based on extremely conservative algorithms that avoid data racing conditions in the shared memory while updating the data from each core. In this implementation, a new barrier algorithm is developed to write-back and invalidate only the necessary data and enable fast synchronization. Upon completion of the first part, z n and r arm are written back, and i is invalidated from cores 1 to 6. From core 7, N on is written back, and z n and r arm are invalidated. The write-back and invalidate mechanisms thus avoids racing conditions that may lead to inaccurate data update in the memory. Similarly, upon the completion of the second part, N on is invalidated from cores 1 to 6 and i is written back from core 7. After the completion of each part and the corresponding write-back/invalidate, the cores wait to synchronize. The algorithm to synchronize multiple cores during parallel executions is based on Lamport's Bakery algorithm [10] . This algorithm enables fast synchronization. It uses two 8B character buffers, named buffers 1 and 2, in a noncacheable shared memory (like DDR3 memory). An element in buffer-1 upon completion of synchronization is set by each core. Once the tasks in a part are completed in the core, an element in buffer-2 is set and the corresponding element in buffer-1 is cleared by the core. The cores then wait until all the elements are cleared in buffer-1. Once buffer-1 is cleared by all the cores, the element in buffer-2 is cleared by each core. The cores then wait until all the elements in buffer-2 are cleared. Once all the elements in buffer-2 are cleared, the synchronization process is completed.
The real-time simulation algorithm is summarized in Fig. 6 .
B. Memory Allocation
The data and program distribution in the memory to implement the real-time simulation algorithm is shown in Fig. 7 For example, the capacitor voltages are placed in the local L1D memory of cores 1 to 6 to reduce the memory access time in every simulation time-step. The capacitor voltages are used in the SM capacitor voltage system and SM capacitor voltage balancing functions. The access to capacitor voltages through local L1D memory in cores 1 to 6 maximizes the computational performance. Similar argument can be extended to the arm current of three consecutive time-steps, switching signals, and control internal variables. Based on the L1D memory availability in C6678 and assuming a 16 KB L1 cache, data for up to 800 SMs/arm can be stored. The cache size is required to allow temporary mapping of data in the higher memory levels (like L2, MSM) that are used during the simulation. The rest of the data is required to be shared and placed in MSM as they are used by both arm current system or control and SM capacitor voltage system or control. The SM capacitor voltage balancing and reinitialization functions are placed in L2 memory of cores 1 to 6 as they are accessed frequently. The arm current control, SM capacitor voltage system, and arm current system are placed in MSM due to the size of the corresponding programs.
C. Other Implementation Optimization
The ability to attain a very high GFLOPS per core is extremely critical to achieve real-time simulation. Each of the functions described above are, hence, optimized using the following techniques to reach close to the 16 GFLOPS maximum performance provided by each core of HPC hardware:
• Eliminate loop-carried dependencies: Loop unrolling can be increased by reducing loopcarried dependencies using restrict keyword on loop variables to indicate that input and output variables are independent.
• Loop unrolling: There are two registers in the DSP. Each register holds four functional units .L, .M, .S, and .D, which can provide efficient pipelining of up to 8 parallel instructions each cycle. Better resource utilization and pipelining can be achieved through balancing the units in the two registers. The number of times the loops are unrolled can be determined by balancing the units in the two registers. These techniques have been applied to the reinitialization, SM capacitor voltage balancing, and SM capacitor voltage system programs that has resulted in corresponding unrolling by 4x, 4x, and 2x, respectively.
• Removal of dependencies and data alignment:
The use of DATA ALIGN pragma along with nassert() functions on the loop variables enable use of wider load and aligned memory access instructions.
• Use of intrinsics: Intrinsics have been used to calculate the non-linear functions like squareroot ( rsqrsp), reciprocals ( rcpdp, rcpsp), absolute value ( fabs), and double to integer conversions ( dpint). The intrinsics provide accuracy up to the eighth binary position in the mantissa. For higher accuracy, NewtonRaphson interpolation is used.
• Optimize non-linear implementations like square-root function, division, and trignometric operations based on accuracy requirements. The accuracy requirements can reduce the Newton-Raphson interpolations required. One Newton-Raphson iteration can improve the accuracy in the mantissa to up to sixteenth position. The accuracy requirements also determine if single-precision or doubleprecision floats are required.
V. CASE STUDIES
In this section, a real-world MMC system is considered to validate the accuracy of the developed real-time simulation platform with reference MMC model developed in [8] . A 401-level study MMC system is considered based on the France-Spain MMC-HVDC interconnection described in [11] . Only one MMC is considered for the validation purposes. The dc-link is assumed to be a dc-source and the ac-side is assumed to be a 3-phase acsource.
A. Validation of Simulation Algorithm
Two case-studies are considered to compare the results from the real-time simulation platform with respect to the reference MMC model: (i) steadystate, and (ii) step-change in current. The arm currents from the proposed real-time simulation are shown in Fig. 8 . The reference simulation results are also shown in the figure. The proposed realtime simulation algorithm produces less than 1% error with respect to the reference results.
B. Discussions
A comparison with real-time simulators (that use CPUs and FPGAs [4] - [7] ) is shown in this section. The implementation in this paper does not consider a multi-rate simulation of the MMC states like in [4] - [7] . This enables to capture the harmonics in voltages and currents accurately, which assumes significance when connecting MMCs to weak grids. The harmonic limits have been defined for different grid strengths for up to 50 th harmonic in [12] , which would require small time-steps of the order of a few micro-seconds.
The real-time implementation proposed here can simulate 425 SMs/arm/core and with a time-step of 5 μs in real-time. The time-step of 5 μs is considered to capture every change in the SM status. The best CPU-based real-time simulation of a MMC system among the real-time implementations in [4] - [7] can simulate 230 SMs/arm/core and with a timestep of 16.5 μs. The following quality measure is considered to compare the proposed real-time HIL solution with the existing solutions: Q = (Number of SMs per arm that can be simulated in realtime)/(Simulation time-step)/(GFLOPS measure of the system). This ratio indicates the impact of the quality of the simulation algorithm and the real-time implementation in the hardware. The aforementioned quality measure is 15 times higher with the proposed algorithm and real-time implementation as compared to the best existing case. Since the cost of Intel and AMD CPUs and TI C6678 are similar, the aforementioned comparison is performed. A comparison with FPGA-based real-time simulators is not considered due to high costs associated with the FPGAs. For example, the cost of TI C6678 is 1/4 that of Xylinx Virtex-6 FPGA and 1/8-1/10 that of Xylinx Virtex-7 FPGA.
The proposed real-time simulation of MMCs will be configured in a power electronic hardware-in-theloop (PE-HIL) configuration to evaluate the novel power electronic SM architectures and advanced cooling materials in SMs. The PE-HIL concept using the proposed real-time simulation method is shown in Fig. 9 .
VI. CONCLUSIONS
A real-time simulation algorithm in a low-cost HPC DSP is described in this paper. It shows an improvement by a factor of 15 with respect to the best case existing real-time CPU-based simulator. The results from real-time simulation show very low errors with respect to reference results. 
