Abstract
Introduction
The scheduling of parallel jobs has long been an active area of research [7, 8] . It is a challenging problem because the performance and applicability of parallel scheduling algorithms is highly dependent upon factors at different levels: the workload, the parallel programming language, the operating system (OS), and the machine architecture.
Time-sharing scheduling algorithms are particularly attractive because they can provide good response time without migration or predictions on the execution time of the parallel jobs. However, to achieve good performance, time-sharing algorithms require communicating processes to be scheduled simultaneously. This is a critical problem because the software communication overhead and the scheduling overhead to wake up a sleeping process dominate the communication time on most parallel machines.
In recent years, researchers have developed parallel scheduling algorithms that can be loosely organized into three main classes, according to the degree of coordination between processors: explicit coscheduling, local scheduling and implicit or dynamic coscheduling.
On the one end of the spectrum, explicit coscheduling [6] ensures that the scheduling of communicating jobs is coordinated by creating a static global list of the order in which jobs should be scheduled and then requiring a simultaneous context-switch across all processors. Unfortunately, this approach is neither scalable nor reliable. Furthermore, it requires that the schedule of communicating processes be precomputed, thus complicating the coscheduling of applications and requiring pessimistic assumptions about which processes communicate with one another. Lastly, explicit coscheduling of parallel jobs also adversely affects performance on interactive and I/O-based jobs [13] .
At the other end of the spectrum is local scheduling, where each processor independently schedules its processes. While this approach is attractive due to its ease of construction, the performance of fine-grain communicating jobs is severely impacted because scheduling is not coordinated across processors [10] .
An intermediate approach developed at UC Berkeley and MIT in recent years is implicit or dynamic coscheduling [1, 5, 15, 20] . With implicit coscheduling, each local scheduler makes independent decisions that dynamically coordinate the scheduling actions of cooperating processes across processors. These actions are based on local events that occur naturally within communicating applications. For example, on message arrival, a processor speculatively assumes that the sender is active and will likely send more messages in the near future. The implicit information available for implicit coscheduling consists of two inherent events: response time and message arrival [1] .
The programming model used in the implementation of implicit coscheduling does not support a full-fledged communication library as MPI and considers only three basic communication operations: reads and writes, requestresponse messages between pairs of processes requiring the requesting process to wait for the response, and barriers to synchronize all processes.
The limitations of the above localized flow-control strategy emerge when processes perform continuous reads or writes in an irregular communication pattern, e.g., they can flood the output buffers with write operations [1] . Some of these limitations are addressed in [14] with a technique called periodic boost. Instead of sending an interrupt for each incoming message, the kernel periodically examines the status of the network interface, thus reducing the overhead with high communication workloads. Our methodology is based on a similar buffering technique which is integrated with global time-slicing and and a strobing algorithm.
The rest of the paper is organized as follows. Section 2 provides the motivation for our buffered coscheduling methodology. The methodology itself is described in Section 3 and some preliminary results are presented in Section 4. Finally, we present our conclusions in Section 5. Figure 1 shows the global processor and network utilization (i.e., the number of active processors and the fraction of active links) during the execution of a transpose FFT algorithm on a parallel machine with 256 processors. These processors are connected with an indirect interconnection network using state-of-the-art routers [3] . Based on these figures, there is obviously an uneven and inefficient use of system resources. During the two computational phases of the transpose, the network is idle. Conversely, when the network is actively transmitting messages, the processors are essentially idle. These characteristics are shared by many SPMD programs, including Accelerated Strategic Computing Initiative (ASCI) application codes such as Sweep3D [11] . Hence, there is tremendous potential for increasing resource utilization in a parallel machine. Another important characteristic shared by many parallel programs is their access pattern to the network. The vast majority of parallel applications display bursty communication patterns with alternating spikes of impulsive communication with periods of inactivity [16] . Thus, there exists a significant amount of unused network bandwidth which could be used for other purposes.
Motivation

Strobing
The uneven resource utilization and the periodic, bursty communication patterns generated by many parallel applications can be exploited to perform a total exchange of information and a synchronization of processors at regular intervals with little additional cost. This provides the parallel machine with the capability of filling in communication holes generated by parallel applications.
In order to provide the above capability, we propose a strobing mechanism to support the scheduling of a set of parallel jobs which share a parallel machine. Let us assume that each parallel job runs on the entire set of p processors, i.e., jobs are time-sharing the whole machine. At a high level, the strobing mechanism performs an optimized total-exchange of control information which then triggers the downloading of any buffered packets into the network.
The strobe can be implemented by designating one of the processors as the master, the one who generates the "heartbeat" of the strobe. The generation of heartbeats is achieved by using a timeout mechanism which can be associated with the network interface card (NIC). This ensures that strobing incurs little CPU overhead as most NICs can count down and send packets asynchronously. This is true for a wide range of NICs, ranging from simple 100-Mb/s Ethernet cards to more sophisticated cards such as Myrinet [3] .
On reception of the heartbeat, each processor (excluding the master) is interrupted and downloads a broadcast heartbeat into network. After downloading the heartbeat, the processor continues running the currently active job. (This ensures computation is overlapped with communication.) When p heartbeats arrive at a processor, the processor enters a strobing phase where its kernel downloads any buffered packets to the network 1 . Figure 2 outlines how computation and communication can be scheduled over a generic processor. At the beginning of the heartbeat, t 0 , the kernel downloads control packets for the total exchange of information. During the execution of the barrier synchronization, the user process then regains control of the processor; and at the end of it, the kernel schedules the pending communication accumulated before t 0 to be delivered in the current time slice, i.e., . At t 1 , the processor will know the number of incoming packets that it is going to receive in the communication time-slice as well 1 Each heartbeat contains information on which processes have packets ready for download and which processes are asleep waiting to upload a packet from a particular processor. This information is characterized on a per-process basis so that on reception of the heartbeat, every processor will know which processes have data heading for them and which processes on that processor they are from. as the sources of the packets and will start the downloading of outgoing packets. This strategy can be easily extended to deal with spacesharing where different regions run different sets of programs [6, 12, 21] . In this case, all regions are synchronized by the same heartbeat.
The total exchange of information can be properly optimized by exploiting the low-level features of the interconnection network. For example, if control packets are given higher priority than background traffic at the sending and receiving endpoints, they can be delivered with predictable network latency 2 during the execution of a direct total-exchange algorithm 3 ( Figure 3 ). We generated this distribution using a network of 256 processing nodes equipped with wormhole routers similar to those in the SGI Origin 2000 and assumed the existence of random background traffic that occupies 80% of the network capacity. If control packets are prioritized at the network endpoints, they can be delivered with a bounded latency of 30 s.
We also analyzed the execution time of the direct totalexchange algorithm in a family of indirect networks with up to 1024 processing nodes. In this experiment, whose results are shown in Figure 4 , we assume the existence of background traffic that varies from 20 to 80 of the network capacity. We can see that the execution time is largely insensitive to the intensity of the background traffic. With 64 processing nodes (the configuration of a single SGI Origin 2000 cluster) the execution time is only 50 s, and this increases to 150 s with 256 nodes. Due to the quadratic increase in the number of messages sent during the total exchange, the execution time reaches 1 ms with 1024 nodes, limiting the scalability of the approach. This scalability problem can be addressed in a clustered architecture like ASCI Blue Mountain by using a multiphase, indirect algorithm. In the first phase, we perform a total exchange within each cluster. Next, we do a total exchange between clusters. Finally, we conclude with a final phase internal to the clusters, giving a barrier synchronization time of less than 300 s.
The global knowledge of the communication pattern provided by the total exchange allows for the implementation of efficient flow-control strategies. For example, it is possible to avoid congestion inside the network by carefully scheduling the communication pattern and limiting the negative effects of hot spots by damping the maximum amount of information addressed to each processor during a timeslice. The same information can be used at the kernel level to provide fault-tolerant communication. For example, the knowledge of the number of incoming packets greatly simplifies the implementation of receiver-initiated recovery protocols.
Blocking vs. Non-Blocking
One of the most limiting constraints in the implementation of time-sharing algorithms is the need to schedule simultaneously communicating processes. This problem is exacerbated with blocking communication, which imposes an explicit handshake between sender and receiver.
We argue that this problem can eliminated, or at least alleviated, by slightly modifying the communication structure of parallel jobs and replacing blocking communication with non-blocking primitives and/or one-sided communication. Let us consider the following example. The dynamics of a message-passing program can be represented as a twodimensional graph with processes on the horizontal axis and time on the vertical one, as shown in Figure 5 . Arrows between processes represent communication between a sender and a receiver. In Figure 5 (a), three processes exchange messages. For the sake of convenience, let us assume that there is no dependency between the messages (i.e., they can be sent in any order). Using a traditional, blocking, message-passing programming style, we must define a communication schedule even if one is not required, e.g., A sends to B, B receives from A and sends to C, C receives from B and sends to A.
With one-sided communication (or non-blocking communication primitives, in general), the actual message transmission and the synchronization are decoupled, leaving many degrees of freedom to re-arrange message transmission. In Figure 5 (b), the same communication pattern is delimited by two barriers which include the communication executed with put primitives. The communication can be executed in any order, provided that the information is delivered at the end of the synchronization calls. Also, communicating processes do not need to be simultaneously scheduled to perform the communication.
Bulk-Synchronous Parallel Programs
Using our proposed strobing and buffering mechanisms, any generic parallel program can be transformed into a Bulk-Synchronous Parallel (BSP) one [19] . Although the buffering and strobing mechanisms alone improve parallel program performance, transforming a parallel program into a BSP one not only can improve performance further but also allows for accurate prediction of the execution times.
A BSP computation consists of a sequence of parallel supersteps. During a superstep, each processor can perform a number of computation steps on values held locally at the beginning of the superstep and can issue various remote read and write requests that are buffered and delivered at the end of the superstep. This implies that communication is clearly separated from synchronization, i.e. it can be performed in any order, provided that the information is delivered at the beginning of the following superstep. However, while the supersteps in the original BSP model can be variable in length, our programming model generates computation and communication slots which are fixed in length and are determined by the time-slice.
One important benefit of the BSP model is the ability to accurately predict the execution time requirements of parallel algorithms and programs. This is achieved by constructing analytical formulae that are parameterized by a few constants which capture the computation, communication, and synchronization performance of a p-processor system. These results are based on the experimental evidence that the generic collective communication pattern generated by a superstep called h-relation 4 can be routed with predictable time [9, 17] . This implies that the maximum amount of information sent or received by each processor during a communication time-slice can be statically determined and enforced at run time by a global communication scheduling algorithm. For example, if the duration of the time-slice is and the permeability of the network (i.e., the inverse of the aggregate network bandwidth) is g, the upper bound h max of information, expressed in bytes, that can be sent or received by a single processor is h max = g . Furthermore, by globally scheduling a communication pattern, as described in Section 3.2, we can derive an accurate estimate of the communication time with simple analytical models already developed for the BSP model [4] . Another important benefit of the BSP model is higher resource utilization over the parallel machine, irrespective of the computational and communication patterns. cessor receives h max bytes) or a more dense communication pattern (where more processors share the same upper bound) can be routed in the same communication timeslice. This means that it is possible to use spare communication bandwidth to deliver packets generated by other parallel jobs without detrimental effects. More generally, as with any multiprogrammed system, multitasking a collection of bad (parallel) programs, i.e., unbalanced computation or communication, may produce the same behavior as a single well-behaved (parallel) program. Multitasking can provide opportunities for filling in "spare communication cycles" by merging sparse communication patterns together to produce a denser communication pattern.
Lastly, the BSP model is also beneficial for fault tolerance 5 . Fault tolerance can be naturally implemented by checkpointing the machine at the synchronization points at the end of a time-slice.
Experimental Results
Our preliminary results include a working implementation of a representative subset of MPI-2 on a detailed (register-level) simulation model [18] . The simulation environment includes a standard version of MPI-2 and a multitasking one that implements the main features of our proposed methodology.
Characteristics of the Synthetic Workloads
As in [5] , the workloads used consist of a collection of single-program multiple-data (SPMD) parallel jobs that alternate phases of purely local computation with phases of interprocess communication. A parallel job consists of a group of P processes where each process is mapped onto a processor throughout its execution. Processes compute locally for a time uniformly selected in the interval 
The Simulation Model
The simulation tool that we used in our experimental evaluation is called SMART (Simulator of Massive ARchitectures and Topologies) [18] , a flexible tool designed to model the fundamental characteristics of a massively parallel architecture.
The current version of SMART is based on the x86 instruction set. The architectural design of the processing nodes is inspired by the Pentium II family of processors. In particular, it models a two-level cache hierarchy with a write-back L1 policy and non-blocking caches.
For our experiments, we assume a network of 32 processors, each running at 500 MHz, interconnected in a 5-dimensional cube topology with performance characteristics similar to those of Myrinet routing and network cards [3] . This network features a one-way data rate of about 1 Gb/s and a base network latency of few s.
The run-time support running on this simulated platform includes a standard version of a substantive subset of MPI-2 and a multitasking version of the same subset that performs the strobing algorithm at the end of each time-slice as outlined in Section 3. It is worth noting that the multitasking MPI-2 version is actually much simpler than the sequential one because the buffering of the communication primitives greatly simplifies the run-time support.
Sensitivity Analysis
Figures 6 and 7 illustrate the communication and computation characteristics of our synthetic benchmarks as a function of the communication pattern, granularity, loadimbalance, time-slice duration, and context-switch penalty. Each bar shows the percentage of time spent in one of the following states, averaged over all processors: computing, context-switching and idling.
For each communication pattern, we analyze the Cartesian product of nine alternatives generated by considering time-slices of 500 s, 1 ms and 2 ms with context-switch penalties of 50, 100, and 200 s. For each alternative, we reduce the computational grain size g, going from left to right, from 50 ms down to 100 s and consider "six groups of three bars" of experiments. Each group has the same computational granularity, and the load imbalance is increased as a function of the granularity itself. We consider three cases: v = 0 (i.e. no variance), v = g (in this case the variance is equal to the computational granularity) and v = 2 g (high degree of imbalance).
At the bottom of each figure we also report the breakdown for the same communication pattern when the workload is run in dedicated mode with standard MPI-2 run-time support (i.e., a single job is run until completion without multitasking). A black square under a bar highlights the configurations where the multitasking approach produces better resource utilization than the standard approach.
Based on Figures 6 and 7 , we make the following observations. First, the performance of buffered coscheduling is sensitive to the context-switch latency. As contextswitch latency decreases, resource utilization and throughput improve. Second, as the load imbalance of a program increases, the idle time increases. Third, and most importantly, these initial results indicate that the time-slice length is a critical parameter in determining overall performance. A short time-slice can achieve excellent load balancing even in the presence of highly unbalanced jobs. The downside is that it amplifies the context-switch latency. On the other hand, a long time-slice can virtually hide all the contextswitch latency, but it cannot reduce the load imbalance, particularly in the presence of fine-grained computation.
More specifically, Figure 6 (g) shows that a relatively small time-slice coupled with a small context-switch latency generally results in better processor utilization than a single job running in a dedicated environment (Figure 6 (l)) in eleven cases out of eighteen. Running a single job provides only slightly better (less than 10) performance with perfect load balancing (v = 0 ) because we have to pay the context-switch penalty without improving the load balance. On the other hand, in the presence of load imbalance, job multitasking can smooth out the differences in the load.
As a rule of thumb, buffered coscheduling performs admirably as long as the average computational grain size is larger than the time-slice, and the time-slice in turn is sufficiently larger than the context-switch penalty. In addition, when the average computational grain size is larger than the time-slice, the processor utilization is mainly influenced by the degree of imbalance.
In these initial experimental results, we did not take into account the effects of the memory hierarchy on the working sets of different jobs. As a consequence, buffered coscheduling requires a larger main memory in order to avoid memory swapping. We consider this as the main limitation of our approach. 
Conclusion
In this paper we have presented buffered coscheduling, a new methodology to multitask parallel jobs on a parallel computer. Buffered coscheduling represents a significant improvement over existing work reported in the literature. It allows for the implementation of a global scheduling policy, as done in explicit coscheduling, while maintaining the overlapping of computation and communication provided by implicit coscheduling.
We initially addressed the complexity of a huge design space using two families of synthetic workloads. The preliminary experimental results reported in this paper show that our methodology can provide better resource utilization, particularly in the presence of load imbalance and communication-intensive jobs.
We plan to extend these preliminary results by considering the effects of the memory hierarchy in a real application rather than in synthetic workloads and to implement a multitasking version of MPI-2 in a Linux cluster. 
