The mesh is a popular multi-computer topology due to its simplicity and need for few connections, regardless of the size of the system. However, one-to-all (broadcasting) or point-to-point communication between two nodes far away result in a long delay.
Introduction
In recent years, advances in VLSI and computer networking technologies have made it attractive to build multi-computer systems for various applications. Meshconnected multi-computers (MCCs) are widely used for VLSI implementation because of their structural simplicity and regularity. For example, the interconnection area for each processor is xed regardless of the mesh size. The regular interconnection pattern of the mesh topology makes MCCs suitable for solving problems related to matrix manipulation and image processing. In multi-computers, however, some processors may be heavily loaded while others are left idle. Dynamic load distribution is a problem in evenly distributing workload among physically dispersed processors during run time. If the execution time of tasks could be known in advance, the tasks can be evenly distributed to all processors to minimize the completion time of tasks. However, the execution times of tasks in most applications are not known in advance. Therefore, at run time, load distribution is carried out through task migration { the transferring of tasks from highly loaded processors to idle processors. To decide how to perform task migration, information about the remaining workload of processors must be communicated.
Load distribution can be classi ed into two classes known as load sharing and load balancing 8] . Load sharing strives to avoid a situation in which some processors remain idle while others are busy. Load balancing also strives to avoid the same situation but goes a step beyond load sharing by attempting to equalize the loads for all processors.
MCCs have a communication drawback in that each processor is only connected to, at most, four local processors. In dynamic load distribution, this local connectivity makes it di cult to exchange information about the remaining workload and results in long communication delays when tasks are migrated between two processors far from each other. Hence, it is desirable to enhance the communication capability of the MCCs for dynamic load distribution. The mesh with a global bus as a multicomputer structure is an e ective way to do so. We also show that the mesh with a global bus has many salient properties such as a small diameter (maximum distance between any two processor), a relatively small degree (number of links connected to a processor), a small average distance between processors, suitability for broadcasting, small initial data distribution time, etc. These properties are better than the mesh, the hypercube, mesh variants, or hypercube variants.
A dynamic load distribution algorithm for the mesh with a global bus is also proposed in this paper. On the mesh, dynamic load distribution algorithms usually use only local information about neighbor processors or links, because global information is only available after a long delay. However, on the mesh with a global bus, oneto-all communication (broadcasting) can easily be achieved through the bus. That is, sharing information among all processors is easily implemented by using the bus, which is intrinsically good for broadcasting. Moreover, on the mesh with a global bus, long-distance task migration can be completed quickly using the bus.
Several works on dynamic load distribution [8] [9] [10] [11] [12] [13] have been performed. Frank and Robert 10] presented the Gradient Model, which employs a gradient map of the proximity to underloaded processors to guide the migration of tasks from overloaded to underloaded processors. Jian and Hwang 9] used a front-end host machine (called an information collector) as a supervisor. This method assumes that the supervisor is logically connected to all processors. The supervisor collects information from all processors and then makes a decision on task migration. Pallab 12] presented an adaptive dynamic load balancing algorithm for distributed systems. In Pallab's method, a threshold parameter, which is used to determine whether a processor is overloaded, is designed to be dynamically adapted to the bandwidth of communication capability. Marc and Anthony 11] compared ve dynamic load balancing methods, known as the Gradient Model, Dimension Exchange, Hierarchical Balancing, Sender Initiated Di usion, and Receiver Initiated Di usion. The Dimension Exchange and Hierarchical Balancing methods were designed for the hypercube and tree structures. On the mesh, according to Marc and Anthony 11], Receiver Initiated Di usion (RID) is comparatively superior to the other methods. In the RID, whenever a processor is underloaded, it requests task migration from its neighbor processors with a speci ed amount of load di erence. The load di erence is represented as the di erence between the number of tasks in a neighbor processor and the average number of tasks in itself and its neighbor processors. When it receives a request, a neighbor processor ful lls the request only up to an amount equal to half of its tasks. The RID method is a nearneighbor di usion approach which employs a method of overlapping local balancing domains to achieve global load balancing. Our proposed method is similar to those of Jian 9] and Pallab 12] . Instead of a host machine, as in Jian's method, we use the bus as the information collector because the bus can be physically connected to all processors and supports fast communication between processors located far from each other. In Pallab's method, the bus is frequently used such as load information collecting. In our method, the bus is rarely used for special communication purposes, which is di erent from Pallab's method. For the bus to e ciently support dynamic load distribution, the bus control logic must have additional functions. In this paper, we design the bus control logic for dynamic load distribution, and investigate the issues involved in implementing a bus on the mesh such as cost, scalability, bus communication time, and bus contention. This paper is organized as follows. In Section 2, the mesh with a global bus is proposed as a multi-computer topology, it is compared with other structures, and the issues in implementing a bus on a mesh are investigated. The proposed dynamic load distribution algorithm and the bus control logic to support the proposed algorithm are presented in Section 3. Section 4 presents the simulation results of the proposed algorithm and compares it with that of the RID. Section 5 summarizes and concludes the paper.
Mesh with a Global Bus
To enhance the communication capability of traditional topologies, several modi ed structures such as the pyramid 1], mesh-of-trees 2], the mesh with recon gurable bus 3], the mesh with multiple buses [4] [5] [6] , and the hypercube with multiple buses 7] have been proposed. The pyramid, mesh-of-trees, and the mesh with recon gurable bus systems were designed for array processors, but are not suitable for multi-computer topologies since they have either many connections or are irregular.
In the case of the mesh with multiple buses (MMB), processing elements in every row and in every column of the mesh are connected to a horizontal bus and a vertical bus, as shown in Figure 1 (a). Each processing element (PE) | which consists of a processor and I/O module | has two additional ports to connect to two buses. Similarly, in the hypercube with multiple buses structure (HMB), the PEs in a subset are connected to a bus while retaining the original hypercube connections using previous links, as shown in Figure 1(b) . Partitioning of all PEs into subsets is achieved by using coding theory. These structures have improved the point-to-point communication delay between two processors located far from each other. The mesh with a global bus (MGB) is shown in Figure 2 (a). All PEs on the mesh and the host computer are connected to a global bus. Each PE requires one additional port to be connected to the bus. All PEs on the MGB can simultaneously read information being broadcast through a bus. By using a bus on the MGB, global information sharing by all PEs is feasible in one bus communication step | the period at which a PE gets the bus grant and then transmits its data to the destination PE (or PEs). Also, the MGB can support communication between two processors located far away with less cost than the MMB or HMB. To compare these structures, we evaluate properties of the hypercube, the mesh, the HMB, the MMB, and the MGB from the viewpoints of diameter, degree, average distance between PEs, and broadcasting steps (number of communication steps required for broadcasting). The properties of the ve structures are summarized in Table 1 , where N represents the number of PEs in the system. Since each PE in the mesh has di erent properties, depending on the shape of the mesh and its position, we select the properties of the PE in the center of the square mesh for simplicity. Also, several values (diameter, average distance, broadcasting) in the HMB's properties are approximated. For a large N, the MGB is better than the other structures with respect to all properties, excluding the degree. Though the degree of the MGB is higher by one than that of the mesh, the other properties of the MGB are much better than those of the mesh. Also, initial data distribution from the host computer to all PEs on the MGB, shown in Figure 3 (a), is more e cient than that of the mesh, shown in Figure 3(b) . In a mesh, the host computer may be connected to PEs on the boundary which have spare links. Then the minimum initial data distribution time is as follows. from the rst column to the last + comm. from the middle of a row to the end) link comm. time times faster than on the mesh. Result collection from all PEs to the host computer is likewise much faster on the MGB than on the mesh. the MGB has good scalability since its size can easily be increased while retaining its regular structure. In practice, bus communication time is proportional to log P, where P is the number of processors connected to the bus 5, 14]. However, if the bus is operated at a high speed, the bus communication delay after acquiring the bus is negligible. Therefore, it is reasonable to assume that bus communication time is constant. For example, the AP1000 multi-computer 17] manufactured by Fujitsu is composed of three independent networks similar to the MGB | bus, torus links, and a synchronization network. In the AP1000, the speeds of the bus and links are 50MB/s and 25MB/s, respectively, and the number of processors can be extended to 1024. The drawback of the MGB is the bus contention problem. Because only one PE is allowed to use a bus at any given time, when two or more PEs try to use the bus at the same time, all PEs except the one with the bus grant must wait until the granted PE nishes its use. If more PEs simultaneously try to use the bus, their waiting time will be increased accordingly. In our proposed scheme on the MGB, the bus contention problem can be diminished by reducing the frequency with which the bus is used or by using a high speed bus. The bus is used only for special communication purposes such as initial data distribution, broadcasting to request load distribution, and task migration, which will be discussed in Section 3.
3 Dynamic Load Distribution on Meshes with Broadcasting
Dynamic Load Distribution Algorithm
When the number of processors is m, the parallel system can be regarded as m pieces of an M/M/1 queuing system. The proposed dynamic load distribution model on an MGB parallel system is shown in Figure 4 , where (i) is the arrival rate of tasks and (i) the departure rate of tasks in the processor i. In the proposed dynamic load distribution method, processors communicate through the mesh links to reduce bus congestion. Through the bus, a request for load distribution is broadcast and task migration 2 is performed after one processor is selected through bus arbitration | the bus control logic selects the processor with the highest priority among those trying to use the bus. On the MGB, dynamic load distribution can be achieved by two methods { a sender-initiated method and a receiver-initiated method. In the sender-initiated method (SIM), whenever a processor (or PE) becomes heavily loaded (the number of tasks is above a certain threshold), it broadcasts a load-sharing request. After receiving this request, one of the idle processors responds to it. Then task(s) are migrated from the heavily loaded processor to the idle processor using the bus. This procedure requires three bus-communication steps.
In the receiver-initiated method (RIM), whenever a processor becomes idle (the number of tasks is zero) it broadcasts a request message. After receiving this request, the most heavily loaded processor (having the most number of tasks) sends a task to the idle processor through the bus. This procedure requires two bus-communication steps. For dynamic load distribution on the MGB, we selected RIM rather than SIM since it takes two bus-communication steps while SIM takes three. We refer to the dynamic load distribution algorithm based on RIM as the Receiver Broadcasting Algorithm (RBA), formally described below:
Receiver Broadcasting Algorithm (RBA) Stage 1. An idle processor broadcasts a load-sharing request message.
Arbitration : When a processor becomes idle, it joins in bus arbitration with priority 0.
Transmission: After getting the bus grant through bus arbitration, it broadcasts the request message and its ID.
Stage 2. The most heavily loaded processor sends a task to the idle processor. Arbitration : After receiving a load-sharing request message, all processors check the number of tasks they have. Every processor that has more than one task joins in bus arbitration with task count as its request priority.
Transmission: After bus arbitration, the processor with the highest number of tasks gets the bus grant, and then sends a task to the idle processor that broadcasted the initial load-sharing request message.
The proposed dynamic load distribution algorithm is explained in Figure 5 . For simplicity, we use a small system consisting of four processors. Whenever a processor (e.g., processor 4) becomes idle after nishing its own tasks, it joins in bus arbitration with priority 0, as shown in Figure 5(a) . After an idle processor gets the bus grant, it broadcasts a load-sharing request message and its ID to all processors, as shown in Figure 5 (b). At this point, Stage 1 (the rst bus-communication step) is nished. In Stage 2, processors whose task count is greater than one (processors 1, 2, 3) join in bus arbitration with priority equal to their respective task counts, as shown in Figure 5 (c). After bus arbitration, the processor with the highest priority (processor 1) gets the bus grant and sends a task to the idle processor which broadcasted the initial load-sharing request message (processor 4), as shown in Figure 5( This algorithm does not require an information process to collect the load information of other processors, which incurs heavy communication tra c. The information process is required by the previous algorithms to decide how to perform the task migration. However, the information process can be eliminated in the proposed algorithm by making the task migration decision through bus arbitration.
By using the bus, this algorithm migrates a task quickly between processors located far from each other. Although task migration between distance processors may be indirectly made by several task migration steps through neighboring processors, this generates a long delay and a lot of communication tra c. However, using the bus for direct task migration, the idle processor can be utilized promptly.
This algorithm migrates the task exactly from the most highly loaded processor to the idle processor. Whenever task migration occurs, the current information about the number of tasks in all processors is used in bus arbitration as the priority and the most highly loaded processor is selected. Inadequate task migration due to obsolete information about the number of tasks in some processors can be avoided.
Bus Control Logic for Receiver Broadcasting Algorithm
To support the RBA on an MGB, an asynchronous bus control logic design is required. In order to support the RBA e ciently, we designed the bus control logic based on Futurebus 15, 16] . The asynchronous bus control logic consists of several lines connected to an open-collector output (wired-OR) and bus control lines connected to each PE. We denote the control lines from the PE to the bus by lower-case letters, and the bus lines by corresponding capital letters. The control lines used are described as follows: R : line indicating the bus requested. Any PE requesting the bus sets its r as 1. Then, R becomes 1 if an r of any PE is 1 because it is a wired-OR. X; Y; Z : lines used for synchronization. X (Y; Z) has the value 0 only when every x (y; z) of all PEs is 0. AB(n) : n-bit priority lines used for bus arbitration. In the load distribution algorithm, they denote the number of tasks at each node. g : bus grant line. After bus arbitration, it becomes 1 if the bus is granted and otherwise is 0.
The ow diagram of our bus control logic is shown in Figure 6 . This ow consists of three continuous operations which are synchronized among all PEs and invoked repeatedly when using the bus. In Op1, each PE that wants to use the bus sets its control line r at 1. In Op2, all requesting PEs compete using their priorities ab(n), and the g value of the PE with the highest priority becomes 1 through arbitration logic (explained later) | only one PE gets the bus grant. In Op3, if a PE gets the bus grant, it broadcasts the load-sharing request message or sends a task to the idle processor. If a PE is competing for the bus in order to send a load-sharing request message, and does not get the bus grant through bus arbitration, it competes again in the next arbitration, shown in Figure 6 (a). If a PE competing to respond to the load-sharing request message does not get the bus grant (as shown in Figure 6 (b)), it cancels its bus request. The synchronization of these three operations among all PEs, without the use of a clock signal, is accomplished by using three wired-OR lines (X; Y; Z). The synchronization process is explained in Figure 7 . Initially, x; y; and z in each PE, and therefore X; Y; and Z, have values of 1, 0, 1, respectively. After each PE has nished its Op 1, it sets its z at 0; Z then becomes 0 after all PEs have nished Op 1. The length of Op 1 is determined by the slowest PE. After completion of Op 1, Y is set at 1 (Y ( 1) | the y of each PE is also set at 1. Op 2 and Op 3 for each PE are set in the same manner.
The state transition diagram for the bus control logic is made using the state of R; X; Y; Z control lines, and is shown in Figure 8 . State transition occurs by checking the values of the control lines following completion of one operation by all PEs. In Figure 8 , sentence A=B near the arrow indicates that A is the input value of the state transition, and B the output value. Bus arbitration must be completed within a brief constant time, regardless of the number of processors participating. Taub 15, 16] presented an e cient asynchronous bus arbitration logic, as shown in Figure 9 . In Figure 9 (a), a one-bit logic circuit is described and in Figure 9 (b), its input-output relationship is described. The value of out is 1, if in is 1 and ab is equal to or larger than Bus, and 0 otherwise. The value of Bus is changed from 0 to 1 when in is 1 and ab is greater than Bus. When in is 1, out is set at 1 and Bus is set at the ORed-value of ab and Bus, if ab is equal to or greater than Bus. Taub's arbitration logic is described in Figure 9 (c). Using this logic, when r is 1, ab(n) of each PE is compared with AB(n) from the most to least signi cant bits. Then, after settling time, the ab(n) with the highest value sets the value of AB(n) as its own and its g becomes 1. However, the g in the ab(n) with the non-highest value becomes 0 because its value is smaller than that of AB(n). Taub's arbitration logic 15, 16] is designed only to resolve requests with di erent priorities. In the RBA, ab(n) represents the number of tasks in each PE using n lines (0 ab(n) < c2 n where c is an interval) | numbers above c2 n are represented as c2 n ? 1. PEs with task counts in the same c-interval range (0 (c ? 1), c (2c ? 1) , , c2 n?1 (c2 n ?1)) have the same priority. Therefore, when processors respond to a load-sharing request message, multiple PEs may get the bus grant when they have the same priority value. We solve this problem with an additional logic, as shown in Figure 10 . To select one PE among those with the same highest priority value, we propose a new arbitration logic in which m ID lines are appended to the n priority lines when the number of the PEs is 2 m (shown in Figure 10(a) ). Because the IDs of all PEs are di erent, with respect to each other, only one PE gets the bus grant. However, since the size of the arbitration logic depends on the number of processors, that size increases. To reduce the cost of the logic, we modify the logic circuit, as shown in Figure 10 (b). The modi ed logic requires a small xed number of gates (in the dotted square) and one additional control line G (compared to the original logic) regardless of the number of processors. The input-output relationship with the added logic is described in Table 2 . When one PE has the highest priority, its g becomes 1.
In the modi ed arbitration logic, when its g is 1, it checks the value of G. If G is 0 (nobody has gotten the bus grant), its g + becomes 1 and G is set at 1. If G is 1 (one PE already has the bus grant), its g + becomes 0. The proposed RBA method is evaluated using simulation. In doing so, we use a \simpack"; a C-programmed simulation tool 18]. Our queuing model for the simulation of a system consisting of four processors is shown in Figure 11 . In this model, links and buses as well as processors are represented as queues. The service time of elements in a link or bus queue is constant (link or bus communication time) but that of a processor queue is exponentially distributed (task execution time). The RID simulation model is shown in Figure 11 (a) and the RBA simulation model is shown in Figure 11 
Evaluation for Static Tasks
We use Speedup as the measure for the comparison of the performance of various load-sharing strategies given a xed set of static tasks. Tasks are de ned as static if all are available at the beginning (no task arrives at run time). The speedup is the ratio of the execution time on one processor to the execution time on multiple processors. Speedup = sequential exec. time(Ts) parallel exec. time(Tp) .
The execution time of tasks is exponentially distributed with a mean of 100 and 100 tasks are evenly distributed to each processor and a 10 10 mesh is assumed. The results are summarized in Table 3 , where RID (Receiver Initiated Di usion) is the dynamic load distribution method known as the best to-date 11]. Parallel execution time (T p ) is the time from the beginning until the most heavily loaded processor nishes execution. To reduce the idle time of processors with no task during load distribution, the RBA is modi ed to RBA*, in which processors broadcast a request message when the number of tasks is 1 rather than 0. Table 3 shows that the Speedup of the RBA is better than those for the RID and the No-Dist, but somewhat worse than that of the RBA*. Also, the number of task migrations (TM) of the RBA* is smaller than that of the RID, but larger than that of the RBA. Because task migration generates communication and a ects the locality of tasks, it is also important to reduce TM while improving Speedup. From the viewpoint of TM, the RBA is slightly better than the RBA* and much better than the RID. These results show that either the RBA or the RBA* is superior to the RID.
To evaluate the scalability of the proposed method, the number of processors in the simulation was also varied. Figure 12 shows that the Speedup of the RBA* method is near that of the Optimal method, superior to that of the RID regardless of the number of processors, and increases linearly as the number of processors increases. Also, bus communication time in the simulation was varied to check its e ect. The line RBA** in Figure 12 represents the simulation result obtained by setting bus communication time at four times slower than that of the RBA*. Although the bus communication time of the RBA** is four times slower than that of the RBA*, the Speedup of the RBA** is very close to that of the RBA*. That is, bus communication time does not have much in uence on the performance of the proposed dynamic load distribution algorithm for the static tasks. This implies that the increased bus communication time caused by bus contention does not signi cantly in uence the performance of the proposed algorithm. Total Run Time 100.
The number of total tasks is given as the number of processors multiplied by 100, and total run time ends when the last task has been completed. The inter-arrival time of dynamic tasks has an exponential distribution, and the arrival rate of tasks in the ith processor is described as i (the mean of the inter-arrival times of tasks is 1/ i ). When the number of processors is m, the average arrival rate of all processors (
) is described as . For some , i s are randomly selected. The execution time of dynamic tasks is exponentially distributed, as in the case of the static tasks with a mean of 100 | which is 1/ when the departure rate (completion rate) of tasks is .
Optimal load distribution for dynamic tasks can be regarded as an M=M=m queuing system. The average utilization of an M=M=m queuing system is m 100 when < m , and otherwise is 100 19] . The average utilization and utilization variance of 100 processors are shown in Figure 13 (a) and (b), respectively. In Figure 13 , the utilization of the RBA is more balanced and better than that of the RID. Also, the utilization of the RBA** (bus communication time is four times slower) is almost equal to that of the RBA. This implies that the RBA is not in uenced by the increase in bus communication time due to bus contention for dynamic tasks. For static tasks, the speedup of the RBA* is somewhat better than that of the RBA, while the TM of the RBA is smaller than that of the RBA*. However, for dynamic tasks, the utilization of the RBA is somewhat better than that of the RBA*. Also, when compared to the RBA*, the TM of the RBA is larger when the arrival rate is low and smaller when the arrival rate is high, as shown in Table 4 . In Table 4 , the TM of the RBA, RBA*, and RBA** are smaller than that of the RID. Bus utilization when the arrival rate of tasks is changed is shown in Table 5 . Bus utilization is highest when the arrival rate is 1 (when m is 1) since at this time task migration occurs most frequently, as shown in Table 4 . Table 5 shows that bus utilization is determined by bus communication time rather than by the arrival rate. To evaluate the scalability of the RBA for dynamic tasks, the number of processors was varied | is varied accordingly for m to be 1, while remains 1. Figure 14 shows that the utilization of the RBA is close to that of an M=M=m queuing system, more balanced and better than that of RID, and does not vary much when the number of processors is increased. Also, the utilization of the RBA** is almost equal to that of the RBA. This implies that the utilization of the RBA for dynamic tasks is not in uenced by an increase in the number of processors nor by an increase in bus communication time due to bus contention. To check an bus contention when the number of processors is increased, the bus utilizations of the RBA and the RBA** according to the number of processors are recorded in Table 6 . Table 6 shows that bus utilization is determined by bus communication time rather than by the number of processors. Although the number of processors is increased, bus utilization remains below a certain threshold. Therefore, we can infer that bus utilization is not in uenced by an increase in the number of processors.
We also evaluated the performace of the proposed algorithm when workloads of processors are severely unbalanced. In this simulation, tasks arrive at a few processors when the total number of processors is 100 and is 1. No task arrives at processors which then become idle. If the number of processors at which tasks arrive is small, the workload di erence among processors becomes large. Figure 15 shows perfor- Figure 15 : Utilization for the number of processors receiving tasks mance according to the number of processors at which tasks arrive. In Figure 15 , the RBA and RBA** are much better than the RID. As workloads of processors become severely unbalanced, the performances of the RBA and RBA** remain strong while the performance of RID becomes poor.
Conclusions
To e ciently support dynamic load distribution with enhanced communication capability, we propose the mesh with a gobal bus as a multi-computer topology. This structure has better properties | a small diameter, a relatively small degree, a small average distance, suitability for broadcasting | than the mesh, the hypercube, mesh variants, or hypercube variants. Also, initial data distribution in the MGB is much faster than that of the mesh. Implementation of a bus on the mesh can be achieved e ciently with low cost and regularity. Bus contention on the MGB (the drawback of the MGB) is avoided by using the bus only for special-purpose communications such as load distribution requests, task migrations, and initial data distribution.
We presented two dynamic load distributed methods (RIM, SIM) for the MGB and selected RIM because of its relatively better execution time. In the proposed dynamic load distribution algorithm (RBA) based on RIM, whenever a processor becomes idle, it broadcasts a load distribution request message. At that point the most heavily loaded processor is selected through bus arbitration and sends a task to the idle processor. To e ciently support the proposed algorithm, we designed an asynchronous bus control and proposed bus arbitration logic to select one processor from among multiple competing processors. Using simulation with a \simpack", we have shown that the proposed dynamic load distribution algorithm is superior to the RID method, previously known as the best on the mesh, regardless of the number of processors. The proposed algorithm shows better total execution time of tasks and processor utilization with a smaller number of task migrations. Also, we have found that the bus communication time (indirectly, bus contention) does not have much of an in uence on the performance of the proposed dynamic load distribution algorithm. Even when workloads of processors become severely unbalanced, the proposed algorithm consistently performs well. Finally, in order to increase the communication capability, we are currently investigating other structures such as the mesh with multiple global buses, and the hypercube with multiple global buses.
