Data flow analysis and optimization is considered for homogeneous rectangular mesh networks. We propose a flow matrix equation which allows a closed form characterization of the nature of the minimal time solution, speedup and a simple method to determine when and how much load to distribute to processors. We also propose a rigorous mathematical proof about the flow matrix optimal solution existence and that the solution is unique. The methodology introduced here is applicable to many interconnection networks and switching protocols (as an example we examine toroidal networks and hypercube networks in this paper). An important application is improving chip area and chip scalability for networks on chips processing divisible style loads.
INTRODUCTION

Background
Networks on chips (NOC) represent the smallest networks that have been implemented to date (Robertazzi 2017) . A popular choice for the interconnection network on such networks on chips is the rectangular mesh. It is straightforward to implement and is a natural choice for a planar chip layout. Data to be processed can be inserted into the chip at one or more so-called "injection points", that is node(s) in the mesh that forward the data to other nodes. Beyond NOCs, injecting data into a parallel processor's interconnection network has been done for some time, for instance in IBM's Bluegene machines (Krevat, Castaños, and Moreira 2002) .
In this paper it is sought to determine, for a single injection point on a homogeneous rectangular mesh, how to optimally assign load to different processors/links in a known timed pattern so as to process a load of data in a minimal amount of time (i.e. minimize makespan). In this paper we succeed in presenting an optimal technique for single point injection in homogeneous meshes that involves no more complexity than linear equation solution. The methodology presented here can be applied to a variety of interconnection networks and switching/scheduling protocols besides those directly covered in this paper. As examples, toroidal and hypercube networks are also considered in this paper. A companion paper examines this problem with multiple sources of load (Zhang 2018 ).
Crucial to our success is the use of divisible load scheduling theory (Bharadwaj, Ghose, and Robertazzi 2003) (Bharadwaj, Ghose, Mani, and Robertazzi 1996) .
Developed over the past few decades, it assumes load is a continuous variable that can be arbitrarily partitioned among processors and links in a network. Use is made of the divisible load scheduling's optimally principle (Bharadwaj, Ghose, Mani, and Robertazzi 1996) (Sohn and Robertazzi 1996) , which says makespan is minimized when one forces all processors to stop at the same time (Bharadwaj, Ghose, Mani, and Robertazzi 1996) (Sohn and Robertazzi 1996) (intuitively otherwise one could transfer load from busy to idle processors to achieve a better solution). This leads to a series of chained linear flow and processing equations that can be solved by linear equation techniques, often yielding recursive and even closed form solutions for quantities such as makespan and speedup.
In this paper, the use of virtual cut-through switching (Kermani and Kleinrock 1979) and a modified version of store and forward switching is investigated. These are one of many switching protocols that the methodology described here applies to. In the virtual cut-through environment, a node can begin relaying the first part of a message (packet) along a transmission path as soon as it starts to arrive at the node, that is, it doesn't have to wait to receive the entire message before it can begin forwarding the message. In pure store and forward switching, messages must be completely be received before being forwarded.
More specifically, first, an equivalent processor (and makespan, speedup and processor load fractions) is found for a 2 × 2 homogeneous mesh network, which can be generalized to a homogeneous 2 × n mesh network. After that, we analyze the more general case of a homogeneous m × n mesh network and obtain a general closed-form matrix representation yielding a processor with equivalent processor speed, makespan, speedup and processor/link load allocation. Different single data injection point positions, such as the corner, boundary and inner grid are also discussed. In addition, a rigorous mathematical proof about the flow matrix solution's existence and uniqueness is presented.
In summary, in this work, a flow matrix quantitative model, which tells one how to deploy the data fractions to each processor in a homogeneous mesh in a makespan optimal manner is proposed. The complexity of the technique is no more than that of linear equation solution complexity. This work has relevance to mesh interconnection networks used in parallel processing in general and to meshes used in Networks on Chips in particular. An important application is improving chip area and chip scalability for networks on chips processing divisible style loads.
Related Work
In path breaking work in the 1990's, Drozdowksi and others and created models and largely recursive solutions for single source distribution in 2D (Błażewicz and Drozdowski 1996) and 3D meshes (Drozdowski and Głazek 1999 ) (Głazek 2003) , toroidal meshes (Błażewicz, Drozdowski, Guinand, and Trystram 1999) and hypercubes (Błazewicz and Drozdowski 1995) . For 2D meshes (Błażewicz and Drozdowski 1996) recursive solutions and closed form asymptotic results were found. This was extended to 3D meshes with recursive solutions for load fractions (Głazek 2003 ) (Drozdowski and Głazek 1999) . Recursive solutions for toroidal networks and hypercubes were also found (Błażewicz, Drozdowski, Guinand, and Trystram 1999) (Błazewicz and Drozdowski 1995) . The hypercube work included a closed form expression for speedup in terms of a fundamental load fraction assignment.
Our Contribution
This work is distinct form earlier work in providing matrix based solutions (created through induction) for 2D meshes, toroidal networks and hypercubes. Also different injection point locations in finite 2D meshes (corner, boundary and center) are considered. Extensive simulation results based on this modeling are presented in (Zhang 2018) .
FLOW MATRIX MODEL
Definitions and Assumptions
Definition 1. Equivalence Computation Equivalence computation is a technique, which consists of combining a cluster of processors as one whole processor with equivalent processing capabilities.
The following assumptions are used throughout the paper:
• Virtual cut-through (Kermani and Kleinrock 1979) switching and store and forward switching is used to transmit the assigned workload between processors.
-Under virtual cut-through switching, a node can relay the beginning bits of a message (packet) before the entire message is received. -Under store and forward switching, a message must be completely received by a node before it can be relayed to the next node along its transmission path.
• For simplicity, return communication is not considered.
• The communication delays are taken into consideration.
• The time taken by computation and communication are assumed to be linear function of the data size.
• The network environment is homogeneous, that is, all the processors have the same computation capacity. The link speeds between any two unit cores are identical.
• The number of outgoing ports in each processor is limited.
• Single Path Communication : data transfer between two nodes follows a single path.
The optimization objective functions is as follows :
• Equivalence computation : the problem's objective function is how to partition and schedule the workloads among the processors to obtain the minimum makespan (finish time).
The minimum time solution is obtained by forcing the processors over a network to stop processing simultaneously. Intuitively, this is because the solution could be improved by transfer load from some busy processors to idle ones (Bharadwaj, Ghose, Mani, and Robertazzi 1996) (Sohn and Robertazzi 1996) .
Processor equivalence is discussed in (Robertazzi 1993 ) (Liu, Zhao, and Li 2007) and figure 1 are examples.
LastName1, LastName2, and LastNameLastAuthor 
Notations
The following notations and definitions are utilized:
• L: The work load.
• D i : The minimum number of hops from the processor P i to the data load injection site L.
• α 0 : The load fraction assigned to the root processor.
• α i : The load fraction assigned to the ith processor.
•α i : The load fraction assigned to each processor on the ith layer i ∈ 0 · · · (k − 1).
• ω i : The inverse computing speed on the ith processor.
• ω eq : The inverse computing speed on an equivalent node collapsed from a cluster of processors.
• r: The rank of the flow matrix.
• z i : The inverse link speed on the ith link.
• T cp : Computing intensity constant. The entire load is processed in time ω i T cp seconds on the ith processor.
• T cm : Communication intensity constant. The entire load is transmitted in time z i T cm seconds over the ith link.
•T f : The finish time of the whole processor network. HereT f is equal to ω eq T cp .
• T f : The finish time for the entire divisible load solved on the root processor. Here T f is equal to
The finish time for the ith processor, i ∈ 0 · · · (m * n − 1).
The ratio between the communication speed to the computation speed, 0 < σ < 1 (Bharadwaj, Ghose, Mani, and Robertazzi 1996) (Hung and Robertazzi 2004 ).
• ∑ m * n−1 i=0
In the virtual cut-through environment, a node can begin relaying the first part of a message (packets) along a transmission path as soon as it starts to arrive at the node , that is, it doesn't have to wait to receive the entire message before it can begin forwarding the message.
First we consider the 2 * 2 mesh network, which can be generalized to a 2 * n mesh network. We then analyze a m * n mesh network and obtain a general closed-form matrix presentation. Finally, we give a key methodology to address this type of question. In addition, different single data injection positions, such as the corner, boundary and inner grid are also discussed.
Data Injection on The Corner Processor
2*2 Mesh Network
The load L is assigned on the corner processor P 0 (figure 2). The whole load is processed by four processors P 0 , P 1 , P 2 , P 3 together. The processor P 0 , P 1 and P 2 start to process its respective load fraction at the same time. This includes P 1 and P 2 as they are relayed load in virtual cut-through mode at t = 0. Because we assume a homogeneous network (in processing speed and communication speed), α 1 = α 2 and P 1 and P 2 stop processing at the same time. The processor P 3 starts to work when the α 1 and α 2 complete transmission. That is, the link 0 − 1 and 0 − 2 are occupied transmitting load to processor 1 and 2, respectively and only transmission to 3 when that is finished.
According to the divisible load theory (Bharadwaj, Ghose, and Robertazzi 2003) , we obtain the timing diagram figure 3.
Here in the Gantt-like timing diagram communication appears above each axis and computations appears below the each axis. Let's assume that all processors stop computing at the same time in order to minimize the makespan (Sohn and Robertazzi 1996) .
Based on the timing diagram, we obtain a group of linear equations to find the fraction workload assigned to each processor α i : Figure 3 : The timing diagram for 2*2 mesh network with virtual cut-through and the root processor is P 0
The group of equations are represented by the matrix form:
The matrix is represented as A × α = b. A is named as the flow matrix. Here because of symmetry α 1 = α 2 , so α 2 is not listed in the matrix equations.
Finally, the explicit solution is:
LastName1, LastName2, and LastNameLastAuthor
The simulation result is illustrated: In figure 4 , the three processors P 0 , P 1 , P 2 have the same data fraction workload, so the curve of α 0 and α 1 coincide. The figure says that as σ grows, the value α 3 drops. In other words, as the communication speed decreases, there is less data workload assigned to P 3 . Further, it means it will be economical to keep the load local on P 0 P 1 P 2 and not distribute it, to other processors. Thus for slow communication α 0 = α 1 = α 2 = 1 3 . The equivalence inverse speed of a a single processor is w eq , that can replace the original network aŝ
For a fast communication (σ ≈ 0), the speedup is 4.
2*n Mesh Network
The 2 * n figure 5 homogeneous mesh network processes load L and L originates P 0 .
Load a distribution from P 0 to P 1 and P 2 via virtual cut-through. After P 1 and P 2 finish receiving load from link 0 − 1 and 0 − 2, they will be used to forward load to P 3 and P 4 and so on.
Similarly to the analysis of figure 3, the timing diagram for figure 5 is shown in figure 6 LastName1, LastName2, and LastNameLastAuthor
Figure 5: 2*n (n = 10) mesh network and the workload happens on P 0
Figure 6: The timing diagram for 2*10 mesh network and the data injection happens on P 0 for virtual cut-through
The equations are presented as: 
LastName1, LastName2, and LastNameLastAuthor
The flow matrix is shown:
According to the Cramer's rule,the explicit solution for the group of equations is:
where A ⋆ i is the matrix formed by replacing the i-th column of A by the column vector b. Specifically,
The equivalence inverse processing speed :T f = 1 * w eq * T cp w eq = α 0 * w Finally, the speedup is:
Further, we prove the matrix det A = 0.
C is a lower triangular matrix and the diagonal elements are not 0. So C is non-degenerate, that is, the matrix is column linear independence.
After a series of column reduction and row reduction actions, we get
, which is still column linear independence. Considering 0 < σ < 1, the flow matrix is full rank. So det A = 0. This proof can be generalized to m × n case.
m*n Mesh Network
Considering a general m * n mesh network, such as figure 7 and figure 1. Utilizing the previous methodology, we obtain the flow matrix equations for figure 7: 
LastName1, LastName2, and LastNameLastAuthor Also, the flow matrix equations for figure 1: 
We use the similar method to prove det A = 0. The equivalence inverse processing speed :
T f = 1 * w eq * T cp w eq = α 0 * w so the speedup is: The number of rows means the number of different type processor data fractions.
After these cases' investigation, we find a crucial fact:
OTHER NETWORKS
Mesh Networks 5 CONCLUSION
In this work a significant problem is addressed: optimal single source load distribution in mesh, toroidal and hypercube networks. This is done by way of example for virtual cut-through switching and a modified version of store and forward switching. However the approach outlined here is applicable to a wide variety of switching/load distribution strategies and architectural parameters. We propose a flow matrix equation method to characterize the nature of the minimal time solution and a simple method to determine when and how much load to distribute to processors. This work demonstrates that mathematical modeling and tractable solutions can be an aid to designing and evaluating scheduling strategies in parallel systems. Parallel systems will be with us for some time so this and related work is likely to be of enduring value.
