Abstract-This paper presents an implementation of the Jacobi power flow algorithm to be run on a single instruction multiple data (SIMD) unit processor. The purpose is to be able to solve a large number of power flows in parallel as quickly as possible. This well-known algorithm was modified taking into account the characteristics of the SIMD architecture. The results show a significant speed-up of the algorithm compared to the time required to solve the algorithm in a conventional CPU, even when a more efficient sequential algorithm, such as the NewtonRaphson, is used. The accuracy of the performance has been validated with the results of the IEEE-118 standard network.
I. INTRODUCTION
Today, a new architecture based on graphic processor units (GPU) exists that provides high performance parallel computing. At first, the GPUs were designed to improve computer graphics, but, little by little, in only a few years, their performance improved, primarily due to the needs of the video game industry. As a result, today, very realistic graphics are generated due to the powerful GPU units available. In only a few years' time, these units have begun to be used for general purpose calculations, yielding important results in Electromagnetism [1] , and Biology [2] . The number of fields GPUs can be employed in continues to grow. In the electrical power engineering field there are some problems that can be solved by GPUs, and some papers have already been published regarding this. Indeed, recently, a single-instructionmultiple-thread (SIMD) algorithm was presented to solve a transient stability problem [3] but with no improvement over already existing commercial software. A real-time power flow SIMD algorithm using symmetrical components is explained in [4] . Another approach for accelerating a power flow was published in [5] , where Gauss-Jacobi and Newton-Raphson algorithms are used, with a speed-up of 10 times in a largescale system. A Newton method and a biconjugate gradient algorithm is described in [6] . This author parallelizes one power flow obtaining an improvement of 2.1 times.
The previously cited papers show the difficulties in parallelizing the equations of the load flow and obtaining a significant improvement using the GPUs' architecture. Additionally, in [6] Amdahl´s law is used to establish an acceleration limit in the load flow parallelization.
Power load flow is one of the most common and important analyses in the electrical power system [7] . It consists of obtaining bus voltages, line powers and generated power in electric power plants. Decisions about design, planning and operation are made using these data; these are the main cornerstones for the proper functioning of the network.
Nowadays, a single load flow can be solved relatively quickly, but in power system analysis there are more complicated problems that include solving a large number of power flows. Among these problems are steady state security assessment [8] , reliability evaluation, stochastic power flow by Montecarlo simulations or planning optimization.
This paper presents an implementation of a Jacobi algorithm for solving power flows on a massive scale, which would allow solving the previously mentioned problems more quickly.
II. GPU ARCHITECTURE
The GPUs have two kinds of programmable processors or arithmetic logic units (ALU): vertex and fragment [3] . Vertex processors work with meshes, doing mathematical operations and defining the position of the elements in the screen. The fragment processors, on the other hand, define the colors, light and textures.
Performing general-purpose computations with a GPU is difficult, hard work using a graphics-only language such as OpenGL or DirectX, because graphics instructions are being used to solve mathematical problems. As a result of the need for easier programming in graphics processors, high level languages HAL and CUDA were developed by the two main GPU builders, AMD and NVIDIA respectively. In this paper we will use CUDA.
CUDA stands for Computer Unified Device Architecture [9] - [11] . This technology allows the user to interact with the graphical processor in an easier fashion, even if the user is not an experienced programmer. This architecture uses all the ALUs to perform the general-purpose computations. Moreover, CUDA complies with the IEEE requirements for single precision floating-point arithmetic.
In CUDA language, an algorithm is a kernel that is run by the graphical unit. Several kernels can be executed at the same 978-1-4244-8782-0/11/$26.00 ©2011 IEEE time. Each kernel is executed by a collection of threads which are gathered in blocks Fig. 1 . All these blocks make up a grid. All the threads execute the same code. In other words, a single instruction is run by multiple threads; this is an SIMD. There is a mechanism to identify each thread inside the block and each block inside the grid, using indexes. These indexes are commonly used to read and write in the memory.
From the programmer's perspective, the kernel is called from the host. When the device has finished the calculations, the execution of the instructions returns to the host. The user must be careful to properly copy the data from the host to the device and vice versa, and also set up the size of the grid and the blocks in accordance to the objective of the problem. Moreover, CUDA provides specific memories that can improve the execution time if they are used properly, because each one is designed for a specific access pattern.
III. SIMD-BASED JACOBI POWER FLOW
In this section, a short explanation of Jacobi´s method, its characteristics, and the implementation of CUDA in the GPU, in order to calculate a large number of power flows, is presented.
A. Jacobi´s algorithm
Jacobi´s method is an iterative solver suitable for calculating a large linear equation system. There are improvements of this method that have a faster convergence, such as the one by Gauss-Seidel. Nevertheless, Jacobi's iterative solver has an advantage over the others in that it is very suitable for parallelizing, as is shown in [12] - [15] .
A linear set of equations can be represented as:
where A is the coefficient matrix, x is the variable vector and b is the vector of the independent terms. Each element of the vector x is calculated in the following manner:
where x k j is an element of the vector x in the iteration k, a ij is an element of the matrix A, and b i is an element of the vector b.
As can be seen in Eq. 2 it is not necessary to know the value of the remaining variables in the iteration k + 1 to be able to calculate x k+1 i For this reason, it is very parallelizable.
B. Massive implementation
As per what was explained in the previous subsection, the calculations in the Jacobi's iterative method do not need to be kept in any sort of synchronization during iteration. Nevertheless, it is necessary to:
• To synchronize between each iteration.
• To access the vector x k−1 by each x k j . These restrictions are very important to be able to decide how the kernel must be programmed to obtain the highest performance.
From the power flow perspective, the algorithm had to obtain the voltage of each electrical node. As is well known, there are three kinds of nodes. There are nodes with a defined consumption of active and reactive power and with an unknown voltage (PQ node). Nevertheless, in the generation nodes for instance, the modulus of the voltage and the active power (PV node) is known. And lastly, a slack bus is set so the generated power and the consumption of power are equal. These nodes require a slight change in the algorithm.
The input data are: a vector with the node type (T ), the specified voltages for nodes PV (U esp ), the active (P ) and reactive (Q) power vector for both nodes PQ and PV, and the admittance matrix. The latter is made up of the diagonal elements stored in vector (Y D ) and a compact matrix (Y R ) to store the remainder elements, using another matrix (I R ) to store the column indexes for the elements. The results (the voltages) of the iteration k + 1 are stored in vector (U k+1 ). Finally, another vector is defined in order to store the voltage values of the iteration k.
To obtain the best performance, each block of threads will solve a load flow. A copy of the admittance matrix and vectors, the specified voltage for the PV nodes, and the active and reactive power for PQ nodes is transferred to the shared memory.
In Alg. 1 the structure of the code programmed in the graphic unit is presented. In line 2, a shared boolean variable that indicates when the solution is reached, is defined. When the solution is reached the block becomes inactive. In line 3 the algorithm makes a copy of the input data in the shared memory. The remaining code is a common implementation of a Jacobi's iterative method, except for the synchronized flags.
The first synchronization, in line 6, ensures that all the threads reach this point and no thread checks the break condition of the while loop. The last synchronization does not permit a thread to begin another iteration before the current one has finished. Synchronize, synchronize all the treads of the block 7: error ← false; 8: if T j = P Q Node then 9:
else if T j = P V Node then 11:
12:
end if 14: if
error ← true; 16: end if
17:
k ← k + 1;
18:
Synchronize, synchronize all the treads of the block 19: end while
In this paper the previously mentioned algorithm is run on an NVidia Tesla C2050. This is a device designed specifically to process general purposed calculations [16] . It allows a maximum setting of 48 kB of shared memory for each block and each block can contain 1024 threads. In the proposed method the number of nodes is limited. In fact, the shared memory runs out before the maximum number of threads per block is used. Nonetheless, this paper addresses the idea of solving a high number of power flows on a graphics unit.
IV. EXPERIMENTAL RESULTS
In order to make a realistic comparison, the results presented in [5] are used. The authors make a comparison between their proposal, a single power flow solved on graphic unit, and a CPU implementation on the Intel Math Kernel Library (MKL).
The analysis consists of a series of executions where a number of load flows are solved at the same time. The network used is the IEEE 118, which is calculated several times in each case. In order to calculate the total CPU execution time, it has been taken into consideration that solving N power flows need N times more time. The convergence criterion is that the error defined in Alg. 1 is lesser than 10 −8 . All the calculations are done using single precision floating-point format.
The benchmark is run on an NVidia Tesla C2050 with 3 GB of memory connected to a PCI Express graphics bus in 16x mode and an Intel i5 750 at 2.67 GHz with 8 GB of RAM. We consider that this hardware is a representative configuration of the current state of the art in the computing calculation based on GPU.
As can be seen in Fig. 2 , the CPU needs much more time. It takes almost 25 minutes to solve 65000 power flows. By contrast, the GPU takes only 18.6 seconds. The improvement is very high, as is shown in Fig. 3 , and the time reduction is very close to 80 times. With a stressed network, near the collapsed point, the proposed method requires more iterations, over to 4 times further. Even so the speed-up keeps on being big, because the Newton Raphson needs more iterations too (near 2 times).
When a large number of blocks are thrown in CUDA, the performance is no faster than in the beginning, Fig. 3 . But in the case of 118 nodes, only 118 threads are used, which is not the number that provides the best performance [9] .
Another important point is that the CPU reference times were obtained using a much better algorithm. For instance, a Newton-Raphson method probably finds a solution in 5 iterations, but with the Jacobi's method 728 iterations are needed.
V. CONCLUSION
Previously published references focus on a parallel implementation in order to solve a single power flow. The speed-ups obtained are in the range of 1 to 2 times, some of them even report a performance of almost 10 times in large-scale systems. These speed-up figures are not amazing and are limited by the size of the sequential part of the algorithms used. Moreover, the transfer data time between the host (the computer) and the device (GPU) is significant when compared to the overall execution time of a single power flow.
The results obtained in this paper suggest it would be better to focus on power system analysis problems where a large number of power flow executions need to be done, including problems such as reliability assessment, probabilistic load flows or steady state security analysis.
Despite using a very basic iterative method with a very poor convergence rate, a speed-up of 80 times is achieved when compared to an efficient multi-core CPU algorithm. Even when this implementation has a limit on the number of nodes, the obtained results are promising. In conclusion, the authors think that specific algorithms which take advantage of the simultaneous solving of multiple power flows should be developed in the future.
ACKNOWLEDGMENT
This work was supported by the Galician autonomous government under the contract INCITE08PXIB303262PR.
