Automatic scheduling for directed acyclic graphs (DAG) and its applications for coarse-grained irregular problems such as large n-body simulation have been studied in the literature. However solving irregular problems with mixed granularities such as sparse matrix factorization is challenging since it requires e cient run-time support to execute a DAG schedule. In this paper, we investigate run-time optimization techniques for executing general asynchronous DAG schedules on distributed memory machines and discuss an approach for exploiting parallelism from commuting operations in the DAG model. Our solution tightly integrates the run-time scheme with a fast communication mechanism to eliminate unnecessary overhead in message bu ering and copying. We present a consistency model incorporating the above optimizations, and taking advantage of task dependence properties to ensure the correctness of execution. We demonstrate the applications of this scheme in sparse matrix factorizations and triangular equation solving for which actual speedups are hard to obtain. We provide a detailed experimental study on Meiko CS-2 to show that the automatically scheduled code has achieved good performance for these di cult problems and the run-time overhead is small compared to total execution times.
Introduction
Solving irregular problems, where patterns of computations and communications are unstructured and/or changing dynamically, is very important to many scienti c applications. The automatic parallelization of such problems on distributed memory machines is extremely di cult and presents a great challenge. Automatic scheduling and load balancing techniques are useful in exploiting irregular parallelism in unstructured computations 12, 16, 27] . For iterative irregular problems in which communication and computation phases alternate, the CHAOS system 8] has used the inspector/executor approach to exploit irregular parallelism at each iteration of the computation phase. The problems addressed in CHAOS have no loop-carried dependency at the computation phase and processors can run independently before next communication phase. In a more complicated iterative application, the computation phase may involve task parallelism with irregular dependence structures. For example, in the adaptive n-body simulation using the fast multipole method, parallelism at each time step can be modeled as a directed acyclic task graph (DAG) 12]. The DAG schedule can be re-used for a number of iterations before re-scheduling since particle movement is slow. In solving nonlinear equations using iterative methods, sparse matrix factorization usually dominates the computation time at each iteration step and the topology of the iteration matrix remains the same but the numerical values change at each step 15] . E cient parallelization of sparse factorization requires certain compilation cost, but the optimized solution can be used for many iterations. Sparse matrix 1 factorizations can be modeled as DAGs 23] . Compared to large n-body simulations, e cient execution of DAG schedules for sparse matrix problems is more challenging because partitioned graphs contain both coarse and ne grained tasks.
Most of previous work in DAG scheduling has mainly focused on the algorithmic research for task mapping, e.g. 21, 28] and little research has been conducted on e cient run-time support for executing task schedules. The early work in the PYRROS system 29, 30] provides a complete framework for general task computation; however, its run-time support system uses the NX/2 level communication interface and overhead for message management is high, which prevents PYRROS from obtaining good performance for sparse matrix problems.
We have developed an e cient run-time support system called RAPID for executing general DAG computations with mixed granularities. The distinguishing feature of our work is that we tightly incorporate a fast communication mechanism into our run-time scheme. Such tight integration allows us to eliminate unnecessary cost in managing message bu ers and copying data. The experiments indicate that the overhead is only 6 ? 10% of the execution time for the tested applications. A pitfall in such a scheme is that processors may obtain incorrect data copies without a proper message synchronization. We carefully design the task-level communication and consistency model, and ensure the correctness of execution by taking advantage of properties of task graphs.
We demonstrate the applications of our techniques for sparse matrix computations. It has been very di cult to obtain actual speedups by hand-made code for these problems, not to mention automatically parallelized code, especially on distributed memory machines. Recently impressive performance is obtained for sparse Cholesky factorization 19, 20] and sparse triangular solver 6]. We discuss how we can exploit task parallelism in sparse Cholesky factorization and LU without pivoting using our techniques in a general runtime support environment. One di culty in parallelizing sparse Cholesky factorization is that parallelism also arises from operations that commute. We discuss how to address this issue in our task computation model and present experiments to show that our approach delivers decent performance. As far as we know, this is the best performance achieved by automatically scheduled codes. To demonstrate the applicability of our approach for other sparse matrix problems, we also examine the performance of our run-time techniques for sparse lower triangular solvers and study the impact of granularities on the overall performance.
The paper is organized as follows. Section 2 describes the task computation model. Section 3 identi es critical issues in developing an e cient run-time system. Section 4 presents the design of our system in details. Section 5 gives the correctness analysis of our method. Section 6 presents an experimental study on several sparse matrix computations and provides an analysis on overhead and granularity issues. Section 7 summarizes the related work. Section 8 concludes the paper.
2 The task computation model
Task dependence graphs
Our work is targeted at message passing distributed memory architectures. We use the directed acyclic task graphs (DAGs) as the computation model. We need to study the properties of task dependence so that we can take advantage of those properties in designing the run-time task execution model. Classical data dependence graphs (DDG) for modeling the interaction between computation statements have three 2 types of dependencies: true, anti and output. Task graphs di er from DDG in that they only have true data dependence edges. In a DDG, many of anti or output dependence edges could be redundant if they are subsumed by other true data dependence edges. With the presence of anti and output dependence, synchronization in run-time support becomes more complicated in order to ensure the correctness of execution. In functional languages, the \single assignment" principle has been used in data ow computation and functional languages 17, 21], i.e., each data item can be written at most once, thus there is neither output nor anti dependence. The renaming techniques 7] can eliminate the output and anti dependence in transforming a DDG to a \single assignment" graph. The advantage of such a principle is that dependence enforcement in task execution can be simpli ed to some degree. The disadvantage is that memory usage and data copying overhead increase accordingly.
The task graph model we use is less restrictive than \single assignment" functional data ow graphs in the sense that data items can be modi ed more than once. Such a model more naturally ts into many scienti c problems. For example, linear algebra algorithms usually update matrix elements gradually. Thus unnecessary data renaming and copying can be avoided when we transform a DDG to a task graph.
In transforming the DDG of a program to a task graph, we rst delete all redundant output and anti dependence edges. If the resulting graph contains true data dependence only, then this graph is a task graph. Otherwise renaming of data items is necessary to remove the remaining anti and output dependence.
Informally we call a task graph dependence-complete for a given program if all anti and output dependence relations among tasks of this program are subsumed by edges or paths in this task graph and there exists a sequential program from which this task graph can be derived. Figure 1 depicts a DDG from which a dependence-compete task graph can be derived by deleting all anti and output dependence edges. Formally a task graph G is dependence-complete if it satis es the following four properties: DA1: A task only uses distinct data items. A task T x does not receive data items with the same ID from di erent predecessors. DA2: Write-read and read-write data dependence. If task T x produces data item m and another task T y reads data item m, then there must be either a path from T x to T y or a path from T y to T x in G.
DA3: Write-write data dependence. If task T x produces data item m, another task T y also produces data item m, there must be either a path from T x to T y or a path from T y to T x in G. 3 DA4: A DAG is sequentializable with respect to a topological ordering of tasks. An execution following any dependence order imposed by edges in G should have the same semantics of a sequential program.
These properties DA1 ? 4 impose natural constraints that a task graph should have in representing task computation. These constraints provide a basis for us to design highly e cient execution scheme, thus to exploit parallelism residing in the graph as much as possible. This will become clearer as we discuss in the subsequent sections. More discussion on the above properties and their relationship can be found in 29] . The transformation to convert a DDG to a dependence complete task graph is described in 9]. Most task graphs derived from real scienti c applications satisfy these properties without renaming. For example, sparse LU graphs discussed in Section 6.1.2 are dependence-complete.
Handling commutativity
One disadvantage of a dependence-complete task graph (or DDG) is that it cannot model parallelism arising from commuting operations. For example, in sparse Cholesky factorization all the updating tasks for the same block commute. And this is a potential source of parallelism 1 .
We de ne our commutativity model as follows. For a given data item m, there are a set of commuting tasks CT 1 m , CT 2 m , , CT n m in modifying data item m, and a reduction task RT m uses the result of these n tasks.
This set of commuting tasks can be executed in an arbitrary order. Therefore, depending on whether we take this characteristic into consideration or not, we can have two di erent versions of task graphs. One is based on pure sequential semantics, and the other is based on the commutativity. Figure 2 (a) shows a dependency chain derived from a sequential code, and Figure 2 (b) shows the dependency structure re ecting the commutativity. We call this type of task graphs commutative task graphs. Apparently scheduling a commutative task graph would gain more performance because of more exibility in scheduling. However while the task graph derived from sequential semantics satis es DA1-4, the task graph considering commutativity is not dependence-complete since all commuting tasks (e.g., in Figure 2 (b)) modify the same data item which violates assumptions DA1 and DA3. In order to correctly execute a commutative task graph, we have to alter the task graph after the scheduling. This is done by renaming data items and changing edges as depicted in Figure 3 . We explain the procedure as follows. For each reduction task 1 This is called fan-in parallelism in the literature 14]. Fan-out parallelism in Cholesky factorization is naturally included in the task graph model as submatrix multi-casting edges. 4 RT m in a commutative task graph, we apply the following transformations for its commuting predecessors CT i m 's in each processor P x . 1) We delete all edges from these commuting tasks CT i m 's to the reduction task RT m and create a new data item m x for these commuting tasks in processor P x ; 2) Add a chain of dependence edges between these commuting tasks following the ordering at P x in the DAG schedule and label each edge using the newly-created data item. 3) Add a dependence edge from the last predecessor commuting task on P x to task RT m and also label this edge using the data item m x . It is easy to verify that the transformed graph is dependence-complete. 3 Issues in run-time support of executing task graphs
Techniques involved for executing a task graph schedule could be very simple. Essentially, each processor follows the task execution order, issues a receiving operation for each message a task needs from its predecessors and sends data items produced by this task to its successors. However without a carefully designed communication scheme, the execution e ciency could be very poor. We discuss the major performance issues as follows.
First of all, sending and receiving overheads should be low enough compared to the amount of computing work each task does. Usually in order to support asynchronous communication, a message passing system such as Intel NX/2 uses system bu er space to manage incoming and outgoing messages. It is well known that message-bu ering imposes a signi cant amount of overheads for copying and space management on source and destination processors 26].
Secondly, e ort is required to maintain data dependencies among tasks. Each task has to receive the correct copies of the desired data items. Figure 4 shows a situation in which a task could receive a wrong copy of the desired data item. Task T w should receive the copy of data m from task T z , not the copy of m from task T x . There should be a way to distinguish di erent copies of the same data item.
Thirdly, due to the nonblocking nature of asynchronous communication, returning from a sending operation does not guarantee the message is actually copied or sent out. Thus a proper mechanism is needed to ensure that the message will not be overwritten before it is copied or sent out. For example in Figure 4 , a task T x on processor j sends a data item m to processor k. Another task T h will update data m as well. It is necessary to make sure that task T h will not be executed until the data m is sent out.
Finally, a task may send the same message to several successors and some of these successors could be assigned to the same processor. In this case, it is enough to send this message once to the processor on which those successors reside. Optimizations have to be performed to eliminate redundant messages. Correspondingly, in the destination only one receiving is needed to pull out data from the network. Subsequent receiving operations for this data item should be re-directed to read from the local memory instead.
PYRROS 30] uses a bu ered message-passing mechanism and aggregates communication as much as possible. The overhead is tolerable with medium grain computation, but the lesson learned from the PYRROS project is that these techniques are not su cient to make sending/receiving overhead low enough to e ciently execute task schedules arising from sparse matrix computation with mixed granularities. In the next section, we will show how we can address all these issues in a single run-time support framework by carefully designing the message-passing layer, and the message/data management strategy.
Design of a run-time supporting system
The key ideas of our approach are to integrate the low level communication primitives with carefully designed data space management and ensure the consistency by using the properties of dependence-complete task graphs. We discuss aspects of this system in details as follows.
Remote memory access
Since the existence of system bu ers causes a signi cant amount of overheads, in order to achieve low latency communication, our approach uses Remote Memory Access(RMA) as our basic message passing method. RMA can be implemented in modern multi-processor architectures such as Cray-T3D (SHMEM) and Meiko CS-2 (DMA). With RMA, a processor can write to memory of any other processor if a remote address is given. RMA allows passing data directly from one source location to a destination location, without any copying, packing/unpacking and bu ering. The model of RMA is de ned as follows. When a program in a processor needs to write a data item to another processor, it only needs to provide a remote service descriptor which speci es the source/destination addresses, length of data to be transferred and some other control information. This can be done with a few instructions. After dispatching the RMA descriptor, the data will be transferred asynchronously according to the description. A restriction is that the data transferred through a RMA cannot be rewritten until the RMA completes, otherwise the remote copy of data will become inconsistent. We need to make sure such inconsistency does not happen in the 6 asynchronous execution of tasks.
RMA imposes low overhead on both source and destination processors. We have implemented our system on Meiko CS-2 which provides Direct Memory Access(DMA) as the major way to access non-local memory. Each node of Meiko CS-2 is also equipped with a DMA co-processor for handling communication. It takes the main processor 9 microseconds 22] to dispatch the remote memory access descriptor to the co-processor. The co-processor afterwards will be responsible for sending data without interrupting the main processor so that the opportunity of overlapping communication and computation is maximized. It should be noted that the Meiko does not provide DMA support for multi-casting a data item. Thus we have to implement the multi-casting by repeatedly using the DMA feature. For irregular computation where broadcasting happens less frequently than multi-casting, such design is reasonable.
Maintaining data items and addresses
In order for each processor to initialize data items, we assign an owner processor to each data item. There could be several choices for the owner of each data item: the processor that rst modi es the data item; the processor that uses the data item most frequently; or the last processor that modi es the data item. We adopt the last strategy because it is easy to collect the nal results. Each processor will use some data items other than those it owns and a static data ow analysis is performed to determine a set of data items each processor will use.
RMA requires remote addresses for each processor. If a processor wants to receive a data item from other processors, it has to let all the senders know its address of that data item in advance. For this purpose, on each processor we maintain a hash table which contains addresses of all data items that this processor needs. And the memory space will be allocated in advance for all these data items. The goal of this paper is to evaluate how e cient a run-time system can be and the memory constraint is not considered.
Sending data items
After a task completes its computation, it needs to send the de ned or updated data items to its successors. If two or more successors which receive the same message are in the same processor, then this processor only needs to receive one copy of the message. The sending procedure is shown in Figure 5 (a). When a task tries to send a data item m to another processor, it obtains the remote address of this item m from the local hash table. After that a RMA descriptor is composed and a RMA transfer is initiated. Notice that multiple RMAs can be initiated consecutively without waiting for the completion of previous RMA transfers. After the processor issues all the RMAs needed for the current task, it can continue to execute the next task. Since the startup overhead of initiating a RMA is relatively small, computation can be overlapped with communication to a maximum degree. When this data item arrives at the destination processor, a ag in the destination associated with that data item will be set to indicate its arrival.
Receiving data items and preserving data dependencies
Data consistency is the major concern when a task receives data items. Two issues need to be addressed. Firstly, how to decide whether a data item should be retrieved from the local memory or from the remote processor? Secondly, how can a processor nd out if a data item from the remote processor has arrived at the local memory.
In our scheme, each data item at processor p x is associated with a usage counter and a ag. The \counter" eld of a data item represents the number of tasks at processor p x that use this copy of the data item.
For the \ ag" eld, it indicates the data item has arrived at the local memory if it is set. The counter information for each copy of a data item at processor p x can be determined during a static data ow analysis. We call a data item dead if the corresponding usage counter is zero.
When task T y at processor p x needs to read data item m, it rst checks the usage counter of this data item. If the counter is greater than zero, the task decrements the counter and proceeds. If the data item is dead, the task has to wait until the associated ag is set, which indicates that a new copy of this data item arrives. The task will then clear the ag and set the new counter to be the number of times that this copy will be used on this processor. The receiving procedure is illustrated in Figure 5 (b). The values of the counter and the arrival ag represent the state of a data item. Figure 5 (c) shows how states evolve during the execution of a task graph. The edges are actions taken to transit from one state to another.
For example, in Figure 4 , when task T y rst receives data m from processor j, it will set the usage counter of m to 2 because another task T u in processor k also needs it. Then T y will decrement the counter. After T u uses it, the counter will be decremented to zero. Thus when task T w tries to receive m, it will nd the counter to be zero. Then it has to wait for the arrival of a new copy of data m which is from task T z in processor i. This time the usage counter is 1.
We also explain the reason of using two separate elds \counter" and \ ag" as follows. Suppose they are combined as one, then when data item m is sent from processor p x to p z via RMA, the counter must be set to a proper value before RMA starts and it cannot be changed until RMA completes. Since multi-casting of the same data item happens frequently in applications (e.g., in sparse Cholesky and LU factorizations) and the usage counters of this data item may di er for di erent destination processors. Hence sending data item to one processor cannot be overlapped with sending the same copy of the data item to another processor. 8 
Consistency analysis
In this section, we address the correctness of our asynchronous execution scheme. Two data inconsistencies could arise in the following situations: 1. Data inconsistency caused at receiving sites. Before a data item m on processor p is dead, a new copy of data m could come in and overwrite it. Thus the subsequent reading of data m will not get the correct copy. 2. Data inconsistency caused at sending sites. Since the initiation of a RMA request for a data item m does not guarantee that data m is actually sent out, another task, either at the same processor or at the di erent processor, might overwrite m before the RMA data transfer is completed. Then the destination processor will not have the correct copy it needs.
Fortunately, in the following theorems, we can show that if a task graph is dependence-complete, the above situations will never happen in our execution scheme.
Theorem 1 No receiver site inconsistency. For any dependence-complete graph, if a task T y de nes data m and sends it to task T x , then before T x reads it, no other task will overwrite it.
Proof: Suppose there is another task T z which sends a new copy of data m to the processor of T x before T x reads data m. There must be another task T w that receives data m from T z . If T x is the same as T w , then task T x would receive the same data from T y and T z , which is impossible according to DA1. Thus T x is di erent from T w , and T y and T z nish their execution before T x and T w . According to DA2, there are a dependence path from T y to T w and a path from T z to T x as well (shown in Figure 6 (a)). And DA3 implies that there must be a path between T y and T z . If the path is from T y to T z , then T x uses m from T y and T z rede nes m, which contradicts DA4 and the same is true if the path is from T z to T y . Theorem 2 No sender site inconsistency. For any dependence-complete task graph, if there is a task T x on processor p x that receives data item m from task T w on processor p w , then before the data item m is actually sent out, no other task will rede ne this item on processor p w , i.e., processor p x will receive the the correct copy of data item m.
Proof: Suppose there is another task T y that will rede ne data item m after task T w , then according to 9 DA2 and DA3, there must be a dependence path between T w and T y , and a path between T x and T y as well. As it is shown in Figure 6 (b), we have two cases.
Case1: Suppose there is a dependence path from T x to T y , then task T y will not start executing until the task T x nishes. So task T x will receive the correct version of data item m;
Case2: Suppose there is a dependence path from T y to T x , since T y will modify the data m, it is not sequentializable, which contradicts the DA4.
Experimental studies
We will examine the performance of our system on two types of irregular applications: sparse factorization (Cholesky and LU without pivoting) and sparse triangular solvers. We have developed a set of library functions 9] which allow users to specify data access patterns based on a sequential version of the application. We use the PYRROS scheduling algorithms to produce a task schedule and our run-time system will execute such an irregular schedule. We will analyze the overall performance and itemized cost distribution of the run-time execution.
All our experiments are conducted on the distributed memory machine Meiko CS-2. The single-node LINPACK performance we have tested is 8:04MFLOPS for double-precision oating operations 2 . The cost to dispatch a RMA request is 9 microseconds. The communication peak bandwidth of Meiko CS-2 is 40MBytes/Sec. The e ective peak bandwidth has been reported as 39MB/Sec, but this is obtained through a ping-pong test 22] and message caching improves the performance implicitly. In practice, a message is sent only once at a time. We nd that the e ective bandwidth of a single-message sending or direct memory access is in the range of 10{25 MB/Sec when the sizes of messages vary from 1K to 10K.
Since our experiments use matrices as messages, for small messages with sizes smaller than 1K we list the bandwidths by matrix sizes in Table 1 . Each matrix element is a double-precision number, which is 8 bytes long. In this test, we assume the whole message is pre-fetched into the cache. This assumption is justi able because each task usually sends out the submatrix it just updated. The individual message sizes in our experiments are all less than 6K.
Matrix Size 1x1 2x2 3x3 4x4 5x5 6x6 7x7 8x8 9x9 10x10 11x11 Bandwidth(MB/S) 0.09 0.38 0.83 1.50 2.28 3.22 4.14 5.52 6.55 8.10 9.73 Table 1 : Communication bandwidth for small matrices on Meiko CS-2.
In the rest of this section, we will report our experiments and examine the overall performance of the run-time system for executing task graphs arising from sparse Cholesky factorization, LU without pivoting and sparse triangular solvers. We will study di erent aspects of system performance such as the bene t of exploiting commutativity of operations and the overhead of run-time support.
Sparse matrix factorizations and run-time overhead analysis 6.1.1 Sparse Cholesky factorization
Sequential and parallel sparse Cholesky factorization algorithms have been studied extensively in literature. Rothberg and Schreiber 19, 20] show that the supernode-based approach can deliver good performance on both shared and distributed memory machines. Specialized scheduling techniques have been studied for sparse Cholesky factorization in 11]. We will examine the performance of applying general task scheduling and executing techniques to this problem.
For a sparse matrix, the BLAS-3 level Cholesky factorization algorithm 3 is shown in Figure 7 (a). The n n matrix is divided into N N submatrices and some submatrices are zero due to the sparsity. The task graph only operates on those nonzero submatrices. Given a sparse matrix, we rst use the multiple minimum degree algorithm to decide the ordering and then calculate ll-ins. For instance, Figure 7 (b) is a 21 21 sparse matrix with ll-ins. After that we use supernode partitioning 14] so that every basic task operation involves only dense matrix or vector operations which can be implemented using BLAS and LAPACK routines. We also partition supernodes into smaller blocks. And for those supernodes smaller than the block size used, they will remain unchanged. In this way we can take the advantage of caching and reduce indexing overhead. In the literature, tasks F kk and S ik are also called DIV operations. And tasks M k ij are called MOD operations. An important property is that all the MOD operations with respect to the same block commute. This is a potential source of parallelism as we discussed in Section 2.2. The reduction task here is the DIV operation. In Figure 8 we show two di erent Cholesky task graphs for the example matrix in Figure 7 (b) depending on whether we consider commutativity or not. From this small example it can be seen that exploiting operation commutativity does expose more parallelism.
We have tested a set of benchmark matrices from the Harwell-Boeing collection. BCSSTK15 arises from structural engineering analysis for Module of an O shore Platform, BCSSTK16 is for Corp. of Engineers Dam, BCSSTK17 is for Elevated Pressure Vessel, and BCSSTK24 is for Calgary Olympic Saddledome T (1) T (3) T (2) T (5) T(4) T (6) T (11) T (12) T (14) T (7) T (10) T (9) T (8) T (13) T (15) T (16) T (17) T (18) T (19) T (20) T (21) T (22) T (28) T (37) T (38) T (23) T (25) T (24) T (27) T (26) T(33) T (34) T (36) T (29) T(32) T (31) T (30) T (35) T (39) T (40) T (41) T (42) T (43) T (44) T (45) T (46) T (47) T (57) T (48) T (51) T (50) T (49) T (54) T (53) T (52) T (56) T (55) T(61) T (62) T (66) T (63) T (68) T (69) T (58) T (60) T (59) T (64) T (65) T (67) T (77) T (78) T (79) T (70) T(73) T (72) T (71) T(76) T (75) T (74) T(83) T (84) T (88) T (85) T (90) T (91) T (80) T (82) T (81) T (86) T (87) T (89) T (92) T (93) T (94) T (95 (2) T (5) T (4) T (6) T (17) T (18) T (45) T (7) T (10) T (9) T (8) T (13) T (12) T (11) T (15) T (14) T (16) T (19) T (46) T (95) T (21) T (20) T (22) T (23) T (25) T (24) T (27) T (26) T (28) T (39) T (40) T (29) T(32) T (31) T (30) T (35) T (34) T (33) T (37) T (36) T (38) T (41) T (43) T (42) T (44) T (47) T (48) T (51) T (50) T (49) T (54) T (53) T (52) T (56) T (55) T (57) T (64) T(65) T (66) T (92) T (93) T (58) T (60) T (59) T (62) T (61) T (63) T (68) T (67) T (69) T (70) T(73) T (72) T (71) T (76) T (75) T (74) T (78) T (77) T (79) T (86) T (87) T (88) T (80) T (82) T (81) T (84) T (83) T (85) T (90) Arena. Task graphs derived from these matrices are quite large and unstructured. We summarize some statistics of those task graphs in Table 2 . We have also implemented a sequential version of the sparse Cholesky factorization code to assess the speedups. Its performance on a single Meiko CS-2 node is also listed in Table 2 . The sparse matrix performance on sequential machines is usually poor because of the indexing of irregular data items. Our sparse Cholesky implementation has achieved 73% ? 92% of the dense matrix LINPACK performance. We think this is the best sequential implementation we can have 4 . All parallel speedups are compared directly to this sequential performance.
The parallel performance is shown in Figure 9 (a) and all the speedups are obtained with consideration of commutativity. In average, we have obtained speedup 7 on 16 processors and 11:3 on 32 processors. We will discuss the overhead of run-time execution in Section 6.1.3. The speedup on 32 processors is 11:33 for BCSSTK15 and processor utilization e ciency is 35:4%. In 20], e ciency 25% has been reported for BCSSTK15 on 64 processors(Intel Paragon) after improving the load balancing strategy. This indicates that the parallelism is limited in the sparse problems and our scheduled code has achieved reasonable performance. To the best of our knowledge, this is the highest performance obtained for sparse Cholesky factorization by automatically scheduled code.
We examine how much improvement is obtained by exploring commuting operations. Matrix P=2 P=4 P=8 P=16 P=32 BCSSTK15 4% 4% 10% 9% 15% BCSSTK16 6% 8% 4% 3% 7% BCSSTK17 8% 15% 4% -9% 3% BCSSTK24 6% 1% 4% 1% 4%
(a) (b) Figure 9 : (a) Performances of sparse Cholesky; (b) Execution time improvements after exploiting commutativity in sparse Cholesky factorization.
Sparse LU factorization without pivoting
LU factorization without pivoting can be used for positive de nite or strictly diagonal dominant matrices. Many of algebraic systems arising from the discretization of di usion and convection problems in uid dynamics 18] are nonsymmetric and diagonal dominant. The boundary conditions usually make the rst row of matrices strictly diagonal dominant. For such matrices, it can be shown 5] that stable solutions can be obtained by LU (Gaussian Elimination) without pivoting. It should be noted that parallel sparse LU with pivoting is more complicated 10].
We list the BLAS-3 level sparse LU factorization algorithm in Figure 10 (a). Again we use supernode partitioning to divide a sparse matrix and construct the sparse task graph from the partitioned matrix. It can be shown that a LU task graph is dependence-complete. We have tested two benchmark matrices from Boeing-Harwell collection: BCSSTK15 and BCSSTK24. Both matrices have symmetric structures. The purpose of using these matrices is to compare with the performance of Cholesky factorization and analyze the impact of algorithm characteristics on the code performance. We summarize some statistics of these two matrices and their sequential performance on a single Meiko CS-2 node in Table 3 . The sequential performance reaches 85% of the LINPACK performance.
We use the PYRROS mapping algorithm to generate schedules for LU graphs and use the run-time system discussed in this paper to execute the schedules. Speedups are shown in Figure 10(b) . On 32 processors, we have obtained 12:9 speedup for BCSSTK24, and 9:7 for BCSSTK15. Table 3 : Statistics of BCSSTK15 and BCSSTK24 for Sparse LU factorization.
Run-time overhead analysis
We classify the time spent at each processor into four parts: computation (CPU time used for executing tasks), sending overhead, receiving overhead, and waiting (or spinning) time which is the idle time that the processor spends on waiting for the arrival of the desired data item. If the desired data item arrives before a task issues the receiving primitive, the waiting time will be zero. This is an ideal situation in which communication is totally overlapped with computation. The waiting time is caused by both the network communication delay and the idle time in the pre-determined schedule. We calculate the ratio of sending, receiving and spinning time over the computation time at each processor. Figure 11 and Figure 12 show the overhead distributions in the LU and Cholesky factorizations. The Y axis lists the average ratio of overhead type X (X is sending, receiving or spinning) over the computation time.
It can be seen from both LU and Cholesky cases that when the number of processors increases, the receiving overhead is relatively small, which is about 2% of the computation time. The sending overhead is about 10% of the computation time. Thus the total run-time sending/receiving overhead contributes about 6 ? 10% of the total execution time at each processor. The reason that the receiving cost is much less than the sending cost is as follows. Messages are directly written to the user data space so that tasks can access them as soon as they arrive. However in the sending site, there is at least 9 microseconds cost for each message.
The major loss of the performance goes to the spinning time. The main reason is that a pre-determined schedule does not match actual run-time situation very well. This is because computation weights vary at run-time due to di erent caching behaviors. Also, the scheduler assumes that the sending communication 14 overhead is zero, which is not the case in reality. These two issues make timing among tasks di erent from what is expected at the static time and tend to increase the processor idle time. The performance becomes even more sensitive when task granularities are small 12]. We have examined the distribution of the sizes of supernodes (after blocking) for the tested matrices and foubd that there are a lot of blocks with sizes less than 6, especially for BCSSTK15, 78% of the blocks are of size 1, which makes the corresponding tasks just a few oating operations. In order to improve the performance further, our future work is to consider techniques of increasing average supernode sizes, e.g., supernode amalgamation 2, 19].
Sparse triangular solver and performance impact of granularities
From the discussion in the last section we know that the task granularity of an application is important to the performance of irregular code. In this section we further examine how the granularity of an application impacts both the performance and the overhead derived from the system itself. This study is important because we want to make sure that our run-time scheme performs stably when granularity varies. Table 4 : Statistics of BCSSTK15 and BCSSTK24 for the sparse triangular solver with one right-hand side.
We choose the sparse triangular solver as an example application. Generally a triangular solver is much less time consuming than factorization. A triangular system with multiple right-hand sides is useful in many scienti c applications 13]. An interesting property with the triangular solver is that it is very easy to adjust the granularity of tasks by using di erent number of right-hand sides. Again we rst use the supernode partitioning techniques to divide a triangular sparse matrix into N N 2-D blocks. The righthand sides are divided into N submatrices accordingly. We describe the block-oriented triangular solve algorithm in Figure 13 , where X k and B k are the unknown vectors and right-hand side vectors respectively. Notice that the tasks P ik 's for a given i are commutative. Similar to the sparse Cholesky case, we can also take advantage of the commutativity in triangular solvers. In order to examine the impact of task granularity, we choose instances of triangular solvers with the number of right-hand sides being 1, 10, 25 and 50 respectively. The sequential performances for these cases range from 3:1 to 6:1MFLOPS. We list the MFLOPS obtained for one right-hand side in Table 4 . The speedups are shown in Figure 14 (a). For those cases with the number of right-hand sides greater than 1, decent speedups are obtained. There is not much speedup for the case with 1 right-hand side. But its sequential time is very tiny, less than half second for the tested cases.
To examine the impact of granularity on the run-time overhead, we plot the sending overhead per message for these cases in Figure 14 (b) since the sending overhead dominates the run-time overhead. For each matrix, the curves from the bottom to the top are for P = 16; 8; 4; 2 respectively, where P is the number of processors used. It can be seen that the sending overhead per message varies very little for a given number of processors even the granularity changes signi cantly. Notice that in addition to the start-up cost for initiating a RMA request, the sending overhead also includes the costs for eliminating redundant sendings and indexing irregular data objects. Remember when a task tries to send a data item to another processor P x , it rst has to look up the remote address for the data item from a hash table. That is where the indexing cost comes from. Apparently, the number of data items distributed on each processor is counter proportional to the number of processors used, i.e., the hash table size on each processor is smaller if more processors are used. That is why the sending overhead per message on small number of processors is greater than that on a larger number of processors. However, when the number of processors is large 16 enough, this indexing cost will become less signi cant.
We explain the relation between the sending overhead per message with the granularity as follows. From Table 4 , we can see that the average computation cost per task for one right-hand side is 66:3 s for BCSSTK15 and 76:5 s for BCSSTK24. This average computation cost is relatively close to the sending overhead per message shown in Figure 14 (b). Notice that each task may have several outgoing messages and therefore the cost of computation is relatively smaller than the associated communication cost in those graphs. This is the reason that we do not see good performance for those cases. When increasing the number of right-hand sides to 10, the average computation task cost increases in an order of 10, the computation cost dominates and good speedups are obtained even with limited irregular parallelism. It shows that our system overhead has been kept low enough to gain good performance for sparse computation.
Overhead comparison with a bu ered communication scheme
We further compared our approach with a bu ered communication scheme to evaluate bene ts of the run-time techniques proposed in this paper. Overheadpyrros ). The dense submatrix has a size of 3 (we have similar improvement for other cases) and each task is relatively ne-grained. The matrix dimension is 192 and thus there are 64 blocks in each dimension of the matrix. The overhead for each processor is estimated as parallel time minus computation time. The large improvement ratios shown in Table 5 indicate that the bu ered communication scheme has a poor performance for supporting ne-grained computations.
Related work
We discuss several categories of related work as follows.
Application of graph scheduling for solving scienti c problems. As we discussed before, application of graph scheduling has been used in large n-body simulations 12]. In 25], it is found that pre-scheduling improves the performance of distributed sparse Cholesky factorization by 30% to 40% and there is still a lot of room for obtaining better absolute performance. Recently the work by 6] demonstrates that using both e ective DAG scheduling and low-overhead communication mechanisms, scalable performance can be obtained for solving ne-grained sparse triangular systems.
Run-time support for parallel computations. The Charm 24] system adopts a message driven approach for general asynchronous computation using dynamic scheduling. The Cilk 3] multi-threaded system aims at applications that have \strict" dependence structures. Randomized load balancing and work stealing techniques are used to execute a dynamic DAG. Compared to our work, Charm and Cilk are suitable for relatively coarser grain computation because dynamic load balancing on distributed memory machines has relatively high control overhead. There are also other run-time systems developed for irregular computations. We have mentioned Chaos in the introduction. Multipol 27] is a run-time library system which supports distributed data structures for several kinds of scienti c applications.
Consistency models. The issue of executing a program correctly has been studied in the context of distributed shared memory systems (DSM), e.g. TreadMarks 1] . The main di erence between our work and DSM is that a DSM system does not have any knowledge of a program to be executed. A DSM system maintains the consistency of shared objects for all processors without knowing if those objects will be further accessed or not. For example, two common protocols used in DSM are the write-invalidate and write-update. The rst one does not allow the prefetching of data, but in our model, we allow processors to send data items to tasks before those tasks need them. The second protocol supports prefetching, but the system may send data items to processors which do not need them. Hence DSM normally has higher overhead than our scheme and has a di culty for obtaining good performance for sparse matrix problems.
Low overhead communication mechanism. The lower-level communication primitives of our system could be implemented using active messages 26] which have been used in 6] for executing ne-grained triangular solving DAGs. We nd that it is not easy to integrate active messages with a general task graph execution scheme because careful network polling is required as demonstrated in 6]. In the situation of task parallelism with mixed granularities, it is not easy to decide where to insert the polling code 4]. 18 8 Conclusions
In this paper, we have presented an e cient run-time scheme for supporting task computations and shown its applications in irregular sparse Cholesky, LU factorization without pivoting and triangular solvers. The experiments indicate that overhead of run-time data communication is relatively small. The performance of sparse computation is limited by available parallelism, but not the overhead of the system. In 9, 10] we have examined the application of these techniques in sparse LU factorization with partial pivoting, which is a more challenging problem and provided an application programming interface (API) for users to parallelize irregular code using our run-time scheme.
