ABSTRACT
INTRODUCTION
Matrix multiplication is commonly used in many areas like graph theory, residue-level protein folding [4] , numerical algorithms, digital image processing and others. Working with matrix multiplication algorithm of huge matrices requires a lot of computation time where the complexity time for sequential matrix multiplication algorithm is O (n 3 ), where n is the dimension of the matrix. Because higher computational throughputs are required with the applications, many parallel algorithms based on sequential algorithms are developed to improve the performance of matrix multiplication algorithm. There a lot of improvement [7, 8] done on sequential algorithms to follow the big requirements but still has shown a limitation in performance. For that, parallel approaches have been examined and enhanced for decades.
In common parallel matrix multiplication algorithms used decomposition of matrices depends on the number of processors available in the interconnection network [10, 9] . Each algorithms use the matrices that decomposed into sub matrices (blocks). During execution process of matrix multiplication, each processor calculates a partial multiplication result using the sub matrices that are currently accessed by it. When the multiplication is completed, the coordinator processor assembles and generates the complete matrix multiplication result.
The interconnection networks are the core of a parallel processing system which the system's processors are linked. Due to the big role played by the networks topology to improve the parallel system's performance, Several interconnection network topologies have been proposed for that purpose; such as the tree, hypercube, mesh, ring, and Hex-Cell (HC) [1, 2, 5, 6, 11, 12, 14, 15, 18] .
Among the wide variety of interconnection networks structures proposed for parallel computing systems is Hex-Cell network which received much attention due to the attractive properties inherited in their topology [1, 16, 17] .
The proposed parallel matrix multiplication on the Hex-cell network is implemented by the library Message Passing Interface MPI, where MPI processes are assigned to the cores. If the MPI process is assigned to a core, then it will be parallel computation; but if more than one MPI process is assigned to the same core, then it will be concurrent computation. Experimentation of the proposed algorithm was conducted using IMAN1 supercomputer which is Jordan's first supercomputer. The IMAN1 is available for use by academia and industry in Jordan and the region.
The rest of the paper is organized as follows. Section 2 describes the definition of Hex-Cell network. Section 3 presents the proposed algorithm. Section 4 provides an Analytical Evaluation. Section 5 provides the performance results, and Section 6 summarizes and concludes the paper. • Level 1 states the innermost level corresponding to one hexagon cell.
• Level 2 correlate with the six hexagon cells surrounding the hexagon at level 1.
• Level 3 correlate with the 12 hexagon cells surrounding the six hexagons at level 2.
The levels of Hex-Cell network with depth d are labeled from 1 to d. Each level i has N i nodes, representing processing elements and interconnected in a ring structure [1] .
HC(3)
HC (1) HC(2) Figure 1 . Hex-cell network in different level one, two and three [1] .
The address of each node in the Hex-Cell topology is identified by (S,L,Y) where S denotes the section number, L denotes the level number, and Y denotes the node number on that level labeled from Y 1 ,…, Y n ; where n = ((2×L) -1) [1] .
A node with the address 1.1.1 is the first node that exists at the section number 1 and level number 1, and address 6.1.1 is first node that exists at the section number 6 and level number 1, as shown in Figure 2 . 
3.PMMHC ALGORITHM
In this section, we propose a new Parallel Matrix Multiplication Algorithm on Hex-Cell Network (PMMHC) as shown in Figure 4 . The aim behind the parallelism of the matrix multiplication is to make the algorithm runs faster and more efficient in comparison with the sequential one for very large data matrices. It depends on partitioning matrices of size n into a set of partitions; each partition is assigned to a separate processor to multiply sequentially using sequential matrix multiplication. Thus, the number of partitions depends on the number of the available processors.
In this paper, we apply matrix multiplication on the Hex-cell interconnection network topology. The hex-cell network [1] is divided into six sections as shown in Figure 2 . The proposed algorithm uses each section as ring topology and the root nodes of level 0 depend on one to all personalized broadcast for child's nodes. As shown in figure 4 , the proposed work is assumed that a matrices data is stored in the main coordinator processor (MC), which it will be partitioned, multiply, and then combined at the main coordinator processor. And L0-HC nodes are level 0 ring nodes of Hex-Cell network; L1-Ring coordinators are the root nodes of each ring section correspond each one with the nodes of level 0. Wait until the coordinator who received the data will send an acknowledgment message.
6.
Send a message for the MC informing that the process completed.
7.
MC stops the process of distribution and announces the beginning of the next step. L 1 -Ring Distribution of Data 8. For all ring coordinators L 1 -RCs, do the following in parallel: 9. Blocks of matrix A is partitioned into a number of horizontal stripes. 10 . Blocks of matrix B is presented as a set of vertical stripes. The parallel matrix multiplication on Hex-Cell interconnection network in Figure 4 is illustrated in more details as follows:
Phase 1: Data Distribution Phase. Assume I×K matrix A and a K×J matrix B, and the whole matrices A ik and B kj stored on MC (main coordinator). The distribution phase is composed of three steps as follows (see Figure 4 ):
• Data Distribution in the Main Ring (Lines 1-4 in Figure 4 ). The MC starts the process of data decomposition the initial matrices A and B. We assume all the matrices are square of n×n size, the number of vertical blocks and the number of horizontal blocks are the same and are equal to q and the size of all block is equal to v×v, v=n/q.
• Global Distribution of Data (Lines 5-7 in Figure 4 ). The nodes (in parallel) start sending the partitions through the optical links. As shown in Figure 5 , the nodes {N 0 , N 1 , N 2 , N 3 , N 4 , N 5 } will send their partitions to their directly connected neighbors; rings {R0, R 1 , R 2 , R 3 , R 4 , R 5 }, respectively. It is important to note that each node in the main group (L 0 -Ring) receives an acknowledgement message from its neighbor in the other ring after the process is completed. Consequently, each node in the main ring (L 0 -Ring) sends a message to MC telling that the process was completed. When MC receives messages from all the processors, who participated in the global distribution steps, it announces the beginning of the next step, which is the ring distribution.
• Ring Distribution of Data (Lines 8-12 in Figure 3 ). In this step each L 1 -RC makes blocks of matrix A as a number of horizontal stripes, and matrix B is presented as a set of vertical stripes. The stripe size should be equal to v=n/p (assuming that n is divisible by p), as it will make possible to provide equal distribution of the computational load among the processors. Figure 3 ).
Phase 2: Multiplication Phase (Lines 13-14 in
• All the elementary processors (nodes) in the interconnection network apply sequential matrix multiplies a stripes of A by stripes of B .The processor computes it's part of the product to produce a block of rows of C, as
For i from 1 to n: For j from 1 to p: Let sum = 0 For k from 1 to m: Combining phase is parallelized by reversing the order of steps in the distribution phase as follows:
• Level 1-Ring Data Combining (Lines 15-16 in Figure. 3). The aim of this step is to combine all the result of multiplication for the Hex-Cell sections via electronic links. This is done by first collecting the multiplication from the elementary processors of L 1 -Ring for each section on hex-cell network and stores the first combined partitions in the RCs of each section.
• Global Data Combining (Lines 17-18 in Figure 3 ). All RCs in the whole section of interconnection network will send their chunks of multiplication data via optical links to their corresponding processors in the main Ring (L 0 -Ring).
• Combining Data in the Main Ring (Lines 19 in Figure 3) . MC collects the whole set of data multiplication by combines all partitions in one matrix called C. 
4.ANALYTICAL EVALUATION
This section provides the analytical evaluation of the proposed (PMMHC) parallel Matrix multiplication on Hex-Cell interconnection network. Three performance metrics are used to evaluate the algorithm, namely: Run time complexity, speedup and efficiency.
Run time complexity
Time complexities of distribution phase in PMMHC is the same as complexity of One to all personalized in L 0 -Ring and L 1 -Ring with the different ( ) chunk for each processor, and the time complexities of combining phase in PMMHC the same as complexity of All-to one personalized for L 1 -Ring and L 0 -Ring, So, the total Time communication in the matrix multiplication on HexCell network is:
Time Complexity of Computation for each processor will multiply elements using sequential Matrix multiplication= ( . So, the total Time complexity of the proposed algorithm is:
Speedup
Speedup is one of the performance metrics used in the evaluation of parallel algorithms in general. It evaluates the performance of a parallel algorithm in comparison with its sequential counterpart [3] . The speedup of the PMMHC network is shown in Equation 1.
(1)
Efficiency
The efficiency is another performance metric that is widely used to assess the performance improvement in parallel algorithms in general. Its value represents an indicator on how much do the processors being utilized [3] . The efficiency of the PMMHC network is shown in Equation 2.
5.EXPERIMENTAL RESULTS AND PERFORMANCE EVALUATION
In this section, the results of different simulation runs over different data distributions are presented. Table 1 show the results of speedup for different datasets in which you can observe that in general, the result of speedup is better with large matrices to multiply. IMAN1 Zaina cluster is used to conduct our experiments and open MPI library is used in our implementation of the following parallel matrix multiplication algorithms; and the experimental runs on a dual quad core intel xeon Cpu with smp, 16 gb ram, where the software specification is conducted on scientific linux 6.4 with open mpi 1.5.4, C and C++ compiler. Table 1 shows architectural information about the Hex-Cell interconnection network. Also, it shows information about the expected size of the input data that can be assigned for each group in a lucky-case partitioning, when applying the parallel matrix multiplication on the Hex-Cell interconnection network. Figure 5 shows the speedup for the proposed algorithm according to different matrices sizes. All results are performed on a different number of processors. Where with the data size increases, the run time increases due to the increased number of multiplication and the increased time required for data combining. The size of data assigned to each processor plays a primary role in obtaining the highest speedup values. This means that the ratio between the data size and the number of processors can be considered as an indicator of whether we can obtain a high speedup value or not. 
CONCLUSIONS
In this paper, we present a parallel matrix multiplication on Hex-Cell interconnection network. The proposed parallel matrix multiplication algorithm was simulated over different number of processors, with different sizes of matrices, where the algorithm comprises three phases to be applied on the Hex-Cell interconnection network. These phases are the distribution phase, the multiplication phase using the sequential matrix multiplication, and finally the combining phase. Actually, these phases can be easily modified to suit other application that requires massive data to be manipulated. However, the parallel matrix multiplication on Hex-Cell interconnection network shows higher performance in comparison with its sequential version on a single processor.
As a part of our future work, we aim to conduct a comparative study by applying the matrix multiplication over different interconnection networks. We also aim to extend this study by applying sorting algorithms on Hex-Cell interconnection such as merge sort and quick sort.
