In this paper, a hybrid parallel implementation of inverse matrix computation using SMW formula is proposed. By aggregating the memory bandwidth in the hybrid parallel implementation, the bottleneck due to the memory bandwidth limitation in the authors previous multicore implementation has been dissolved. More than 8 times of speed up is also achieved with dual-core 8-nodes implementation which leads more than 20 simulation steps per second, or near real-time performance.
Introduction
In this paper, we have focused the linear equation solver for Ax = b which appears in many simulation codes. In particular, our interest is in the interactive simulation environment where a part of the matrix A and right hand side vector b are gradually changed due to the progress in time or the real-time interaction by the operator or the user of the running simulation. Figure 1 shows a demonstrative scenario of interactive simulation where we have to solve series of linear equations with matrices A i (i = 0, 1, . . .) such that A i ∼ A i+1 .
Background
In general, it is not recommended to compute the inverse matrix of A to solve a linear equation Ax = b even if A is known to be a sparse matrix [2] . However, if the inverse matrix of an matrix approximate to A is known and there exists any scheme to efficiently compute A −1 by means of the compensation to the known inverse matrix, computing the inverse matrix of A to solve the linear equation for multiple right hand side vectors b i could be an attractive strategy.
The Sherman-Morrison-Woodbury formula [1] (SMW formula, in short), represented as follows, can be used for this purpose.
where A, B, and C are the matrix of size n × n, n × s, and s × n, respectively, and A is non singular. leads the ( A ) −1 . Figure 2 represents the size of matrices appeared in the second term of the right hand side of Eq. (2).
If we define a matrix A as
Though the computational complexity of the computation for
, when n s and s can be thought as a fixed value, total computational complexity of the Eq. (2) becomes O(n 2 ).
Since most of the computational steps in Eq. (2) shown below ( Fig. 3) is simple matrix-matrix multiplication, one can expect to benefit from parallel processing to accelerate the computation. Furthermore, since Δ A c and E c are sparse matrices, one can effectively utilize the sparsity of matrices during the computation of Eq. (2), regardless of whether A is dense or sparse.
In our previous work, the thread parallel execution on a quad core CPU with sparse matrix computation library [4] led us 4.6 times of speedup [3] . However, in steps 7 and 8 which are essentially embarrassingly parallel parts, we could not obtain sufficient parallelism and we observed that these two steps spent 95% of total execution time.
In order to achieve further speedup for real time simulation, we have investigated the reason why steps 7 and 8
Copyright c 2012 The Institute of Electronics, Information and Communication Engineers become the bottleneck. And we guess it is due to the insufficient memory bandwidth of the quad core system used in our previous experiment. Therefore, in this paper, we are going to examine the effect of available memory bandwidth to the performance of the computation of linear equation solver using SMW formula.
Hybrid Parallel Implementation of Inverse Matrix Computation by SMW Formula
In order to increase the available memory bandwidth, we have considered the hybrid parallel implementation of inverse matrix computation. As we have discussed before, all the computational steps considered above can be implemented in either intra-node and/or inter-node parallel execution fashion. The inter-node parallel execution of these steps can utilize aggregated memory bandwidth from all nodes, while it incurs inter-node communication to exchange the intermediate computation results to proceed the computation step. After considering this tradeoff between available memory bandwidth and amount of communication, we have implemented the step 7 to 9 as inter-node parallel execution using MPI while the step 1 through 6 are redundantly computed in every node to reduce the inter-node communication. And the computation steps in each node is further parallelized using OpenMP with the exception for step 4. In this hybrid parallel implementation, inter-node communication is necessary only when 1) the right hand side vector b is scattered among the nodes, 2) the result vector x (= S 9 ) is gathered from the nodes and 3) information (size of O(s 2 )) to re-construct Δ A c and E c is sent, once A 0 −1 has been distributed among the nodes at the initialization step.
In the following discussion, we are going to explain the implementation results. Experimental environment is a PC cluster of 8 nodes interconnected by a Gigabit Ethernet switch. Each node consists of CPU: Core2Duo 2.66 [GHz] with L2 Cache of 4 [MB] and 2 [GB] of Memory, operating on OS: Linux 2.6.19-1.2895.fc6 with Intel compiler icc 12.0 (with -O3 option and MKL10.0.3.020). The size of matrix A is 3759 (n = 3759) and its non zero elements are 1.00% of a whole elements. The size of s is fixed to 32 which is similar to the number of non zero elements in each column of the original matrix A. Figure 4 shows the total execution time of steps 1 through 9. By utilizing the aggregated memory bandwidth, 6.1 times of speedup is achieved for single-core 8-nodes execution as compared to the singe-core single-node execution. Roughly 1.3 times of speedup is further obtained by the dual-core execution on each nodes. Combining these effects, 8.1 times of total speed up becomes available as the result of our hybrid parallel implementation.
Detailed analysis of the implementation results shows that 8.0 times and 7.8 times of speed up is observed for steps 7 and 8, respectively, and the bottleneck in the previous work [3] has been dissolved. We also confirmed that the inter-node communication costs is 3 msec in case of 8nodes implementation which is negligible compered to the reduction in computation time.
Conclusion
In this paper, we have proposed a hybrid parallel implementation of inverse matrix computation using SMW formula. By aggrigating the memory bandwidth in our hybrid parallel implementation, the bottleneck in our previous multicore implementation has been dissolved. More than 8 times of speed up is also achieved with dual-core 8-nodes implementation which leads more than 20 simulation steps per second, or near real-time performance.
