paper considers the performance of interconnection networks in single-core and multi-core cluster architectures. Several parameters and system configurations are considered to optimise the performance of single-core cluster as a baseline experiment. These results have been extended to multi-core cluster architectures. The performance measurement involved communication within and between clusters with varying numbers of processor cores to provide a comparative analysis of single-core and multi-core cluster architectures. Analytical analysis provides validation of the simulation results.
INTRODUCTION
Processor performance is often associated with high processor clock frequencies and increasing power dissipation. Every new performance enhancement in processors leads to greater performance demands. The demand for increasing performance continues and as single-core processors reach their physical limits of possible complexity and speed, the movement towards multi-core processors begins. A multi-core processor means one processor with two or more complete computational engines (cores) within a single processor to enhance performance, reduce power consumption and permit simultaneous processing of multiple tasks [1] . Multi-core architectures have been a major trend over the past decade and allow faster execution of application by taking advantage of parallelism.
With the emergence of high-speed networks, High Performance Computing (HPC) has adopted network-based computing clusters as cost-effective platforms to achieve high performance, which lead the trend towards cluster systems with multi-cores [2] . Multi-core clusters allow for faster execution of applications by taking advantage of the ability to work on multiple cores simultaneously. However, applications on multi-core clusters have not achieved optimal performance and scalability compared to symmetric multiprocessing (SMP) and non-uniform memory access (NUMA) clusters. Several performance models of cluster systems have been proposed in [3] [4] [5] , however the evaluations are confined to a single-core processor in a cluster. In order to take advantage of multi-core processor in a cluster system, it is important to have an in-depth understanding of the characteristics of multi-core clusters and their impact on application performance and behaviour.
Many studies [6] [7] [8] have been carried out to improve the performance of multi-core clusters, but few clearly distinguish the key issue of the performance of interconnection networks. Interconnection networks are critical in achieving high performance in clusters [9] . The existing models are unable to evaluate the potential communication performance of the interconnection networks within an implementation of a multi-core cluster architecture.
The contribution of this paper is to investigate the interconnection network performance of a multi-core cluster architecture, which is based on a multi-cluster environment, in harnessing the power of multi-core clusters. The experiment is based on wormhole flow control. Wormhole flow control has increased in popularity in cluster systems due to its low buffering [10] . It works by dividing packets into a sequence of fixed size units called flits, with channel and buffers allocated to flits. When a flit cannot acquire a buffer, blocking may occur. This will increase the delay in the interconnection network and will affect the performance of the cluster.
II. THE NEW ARCHITECTURE A multi-core cluster is a cluster where all the nodes in the cluster have multi-core processors. In addition, each node may have multiple processors (each of which contains multiple cores). With such cluster nodes, the processors in the node share both memory and their connections to the outside. A new architecture known as the Multi-Core Multi-Cluster Architecture (MCMCA) is introduced in Figure 1 . The structure of MCMCA is derived from a Multi-Stage Clustering System (MSCS) [11] which in turn is based on a basic cluster using single-core nodes.
The MCMCA is built up of a number of clusters where each cluster is composed of a number of nodes. Each node of a cluster has a number of processors, each with two or more cores. Cores on the same chip share the local memory and the cluster nodes are connected through the interconnection network. There are five communication networks in MCMCA. Two of them are commonly found in any multi-core cluster architecture: the intra-chip network (AC) and the inter-chip network (EC). The new communication networks introduced in this paper are the intra-cluster network (ACN); the inter-cluster network (ECN) and the multi-cluster network (MCN).
The communication between two processor cores on the same chip is the intra-chip network (AC). Messages will be divided into a number of cores by the AC network, which acts as a connector between two or more processor cores on the same chip. Dividing the messages into a number of cores, in theory, results in more than twice the performance with lower communication delay [12] . An inter-chip network (EC) provides for communications between processors in different chips but still within the same node. Messages travelling to different chips in the same node will communicate via the intra-chip (AC) and inter-chip (EC) to reach their destination. The intra-cluster network (ACN) is an interconnection network to connect nodes within a cluster. Messages that cross the nodes to other nodes in the same cluster will be connected by the ACN via the intra-chip (AC) and inter-chip (EC) to complete their journeys.
The longest route for messages to travel will involve the ECN and MCN. Messages travelling from their source to their destination between clusters communicate via two interconnection networks. An inter-cluster network (ECN) is used to transmit messages between clusters. The clusters are connected to each other via the multi-cluster network (MCN). When messages reach the other cluster, they will travel by the ECN of the target cluster before arriving at its destination. The same process will continue to the other clusters until all the packets exit the network.
III. SIMULATION
Simulation models of Multi-core Multi-cluster Architecture have been developed using OMNeT++ network simulation software. OMNeT++ is an open source discrete event simulation tool that can be used in the design and analysis of systems in which state changes are discrete [13] . The model is built at run time to form a topology that represents the geometric structure and the communication links between the modules. The simulation can behave with different inputs and the parameters of the model, such as the number of cores per node, the number of clusters, the number of messages to be generated, the message length (M) and the inter-arrival time.
Communication networks in MCMCA are divided into internal-cluster and external-cluster and communication network latency in the architecture will be determined by four factors:
1. Average waiting time at the source node 2. Average transmission time for a message to cross the is the number of processors in each cluster and is the numbers of cores on the processors. is the number of trees in the MCN while C is the number of clusters and m is the number of ports.
2 (1)
The messages enter internal-cluster and external-cluster based on the probability, . The probability of an outgoing request, , represents the messages generated by the source nodes that are sent to the external-cluster while messages injected from a source node with the probability 1-enter an internal-cluster network.
3
is the time for a packet of the message to transmit on a node-to-switch connection (or vice versa) while β is the time for a packet of the message to transmit on a switch-to-switch connection.
is the message length, and are the network and switch latency, while is the transmission time of one byte. 
A. Average Message Latency of Internal-Cluster
The communication latency in internal-cluster includes messages travelling in the intra-chip network (AC), inter-chip network (EC) and intra-cluster network (ACN). In an internal-cluster network, the total arrival rate is , and each message travels on the average distance to cross the network, with each channel receiving messages at a rate of:
The required coefficient to calculate the channel rate of the network is:
Each message may use a different number of channel links to reach its destination. Since this architecture applies to multi-core processors, the total transmission time will be based on the numbers of cores on the processors. Therefore, the average transmission time in internal-cluster and externalcluster can be considered, with a 2j-channel with j-channel in the source node and j-channel in the destination node, as: ,
The probability of a message reaching its destination, , can be computed by:
The average amount of time for a message to wait for a channel, with the blocking probability = , is given by:
Since the transmission time of a message at stage s is equal to the message transfer time and waiting time at subsequent stages in a channel, with 0, , it can be classified by:
12
The waiting time of a message ( ) before entering the network with message arrival rate can be calculated as:
2 13
Lastly, the average time for the flit's tail to reach its destination in the internal-cluster, , is as follows:
The equations for message latency in the internal-cluster communication networks can be expressed as:
B. Average Message Latency of External-Cluster
Messages travelling in an external-cluster communicate via two interconnection networks, the inter-cluster network (ECN) and multi-cluster network (MCN). Thus, the average transmission time for an external-cluster is:
The channel rate for external messages is:
The average time for external messages waiting for a channel at stage s with the channel message rate , can be given by:
Since the transmission time of a message at stage s is equal to the message transfer time and waiting time at subsequent stages in a channel, with 0 , , , it can be classified by:
22
Messages generated by the source nodes are sent to the external-cluster with the probability of outgoing request, with message arrival rate . The waiting time in the externalcluster network ( ) can be computed by: External-cluster messages need to cross transfer switches during their journeys traversing the network. The transfer switches act as simple buffers to combine traffic from/to one cluster to/from other clusters. The waiting time at these buffers, with message arrival rate can be computed as:
1 25 1 26
The total average latency for the last packet to reach its destination in the external-cluster can be computed by:
Therefore, the equations for message latency in the external-cluster communication networks can be expressed as:
C. Average Message Latency of MCMCA
From equations (15) and (28), the average message latency of communication networks in the multi-core multi-cluster architecture can be obtained by the sum of the message latency in internal-cluster and external-cluster as follows:
V. RESULTS AND DISCUSSION
The simulation experiments and analytical calculation have been conducted with different numbers of cores. The analysis is investigated using the interconnection network parameters as in TABLE I with input parameters in TABLE II. The results of the baseline experiment are produced with a single-core cluster architecture as depicted in Figure 2 . Simulation and analytical analysis revealed that the results obtained from the single-core cluster architecture closely matches the results from the model for the multi-cluster architecture presented by Javadi et al. [5] . The results have shown that as the traffic rate increases, the average communication latency increases as the messages have to wait for resources before travelling into a network. The similarity of the results confirm that the simulation model is a good basis to measure the communication latency for a large-scale cluster, and can be extended to multi-core multi-cluster architecture. Analysis has been done with three different numbers of cores in a processor. Figure 3 depicts the simulation and analytical results with flit size equal to 256 bytes while Figure  4 depict the results with flit size equal to 512 bytes. While the results of flit size 512 bytes started to saturate the network on light traffic, a smaller flit size can achieve optimal throughput in heavy traffic. Both figures also show that the traffic starts to saturate at about 50% of traffic rate capacity with single-core and dual-core processors, but a higher throughput is achieved with quad-core processors. This is because of the higher latency occurring in lower traffic, since more messages need to be served in the internal-cluster. These figures also reveal that the architecture can scale well under various configurations. 
VI. CONCLUSION
Multi-core cluster architectures have tremendous potential in the future of computing. However, there exist some issues that can affect the performance of the architecture and limit the gained advantages. This paper has presented a new architecture for building and measuring the performance of interconnection networks in Multi-Core Multi-Cluster Architecture (MCMCA) to overcome the issues. The performance analysis showed that the architecture is able to measure the performance of multi-core cluster under various parameters and system configurations. Higher performance of a large clusters can be achieved by optimizing the interconnection network communication. The results also demonstrate as the number of cores increases, higher traffic rates are achieved, which contribute to higher throughput. The simulation and analytical calculation results match very closely indicated that simulation is a good tool for experimentation with various configurations.
