Abstract-This paper explores the problem of efficiently ordering interprocessor communication (IPC) operations in statically scheduled multiprocessors for iterative dataflow graphs. In most digital signal processing (DSP) applications, the throughput of the system is significantly affected by communication costs. By explicitly modeling these costs within an effective graph-theoretic analysis framework, we show that ordered transaction schedules can significantly outperform self-timed schedules even when synchronization costs are low. However, we also show that when communication latencies are nonnegligible, finding an optimal transaction order given a static schedule is an NP-complete problem, and that this intractability holds both under iterative and noniterative execution. We develop new heuristics for finding efficient transaction orders, and perform an extensive experimental comparison to gauge the performance of these heuristics.
become less expensive, it is increasingly common for embedded systems to incorporate multiple processor architectures. Multiple DSP cores can now be placed on a single chip. For example, the Texas Instruments TNETV3010 multiprocessor, targeted at voice-over-IP applications, integrates six high-performance DSP cores. It consumes 50 times less power than a general-purpose processor core, with over three times the transistor count on a die one-fifth the size. Another example is the Texas Instruments OMAP 59xx family of single-chip application processors.
This paper targets lower-cost, shared memory embedded architectures. IPC is assumed to take place through shared memory, which could be global memory between all processors, or could be distributed between pairs of processors (e.g., hardware first-in-first-out queues or dual ported memory). Such simple communication mechanisms, as opposed to cross bars and elaborate interconnection networks, are common in embedded systems, due to their simplicity and low cost.
A. Scheduling Dataflow Graphs
Our study of multiprocessor implementation strategies in this paper is in the context of homogeneous synchronous dataflow (HSDF) specifications. In HSDF, an application is represented as a directed graph in which vertices (actors) represent computational tasks of arbitrary complexity; edges (arcs) specify data dependencies; and the number of data values (tokens) produced and consumed by each actor is fixed. An actor executes or "fires" when it has enough tokens on its input arcs, and during execution, it produces tokens on its output arcs. HSDF imposes the restriction that on each invocation, each actor consumes exactly one token from each input arc, and produces one token on each output arc. HSDF and closely-related models are used extensively for multiprocessor implementation of embedded signal processing systems (e.g., see [1] , [5] - [7] ). We refer to an HSDF representation of an application as an application graph.
For multiprocessor implementation of dataflow graphs, actors in the graph need to be scheduled. Scheduling can be divided into three steps [8] -assigning actors to processors (processor assignment), ordering the actors assigned to each processor (actor ordering), and determining when each actor should commence execution. All of these tasks can either be performed at run-time or at compile time to give us different scheduling strategies. To reduce run-time overhead and improve predictability, it is often desirable in embedded applications to carry out as many of these steps possible at compile time [8] .
Typically, there is limited information available at compile time since the execution times of the actors are often estimated values. These may be different from the actual execution times due to actors that display run-time variation in their execution times because of conditionals or data-dependent loops within them, for example. However, in a number of important embedded domains, such as DSP, it is widely accepted that execution time estimates are reasonably accurate, and that good compile-time decisions can be based on them. In this paper, we focus on scheduling methods that extensively make use of execution time estimates, and perform the first two steps-processor assignment and actor ordering-at compile time.
In relation to the scheduling taxonomy of Lee and Ha [8] , there are three general strategies with which we are primarily concerned in this paper. In the fully static (FS) strategy, all three scheduling steps are carried out at compile time, including the determination of an exact firing time for each actor. In the selftimed (ST) strategy, on the other hand, processor assignment and actor ordering are performed at compile time, but run-time synchronization is used to determine actor firing times: an ST schedule executes by firing each actor invocation as soon as it can be determined via synchronization that the actor invocations on which is dependent have all completed execution.
The FS and ST methods represent two extremes in the class of scheduling algorithms considered in this paper. The ST method is the least constrained scheme since the only constraints are the IPC dependencies, and it is tolerant of variations in execution times, while the FS strategy only works when tight worst case execution times are available, and forces system performance to conform to the available worst case bounds. When we ignore IPC costs, the ST schedule consequently gives us a lower bound on the average iteration period of the schedule since it executes in an ASAP (as soon as possible) manner.
The ordered transaction (OT) method [7] , [9] falls in-between these two strategies. It is similar to the ST method but also adds the constraint that a linear ordering of the communication actors is determined at compile time, and enforced at run-time. The linear ordering imposed is called the transaction order of the associated multiprocessor implementation.
The FS and OT strategies have significantly lower overall IPC cost since all of the sequencing decisions associated with communication are made at compile time. The ST method, on the other hand, requires more IPC cost since it requires synchronization checks to guarantee the fidelity of each communication operation-that is, to guarantee that buffer underflow and overflow are consistently avoided. Significant compile-time analysis can be performed to streamline this synchronization functionality [10] , [11] .
The metric of interest to us in this paper is the average iteration period . Intuitively, in an iterative execution of a dataflow graph, the iteration period is the number of cycles it takes for each of the actors in a schedule to execute exactly once-i.e., to complete a single graph iteration. Note that it is not necessary in a self-timed schedule for the iteration period to be the same from one graph iteration to the next, even when actor execution times are fixed [12] . The inverse of the average iteration period gives us the throughput , which is the average number of graph iterations carried out per unit time.
B. Terminology and Notation
We denote the set of positive integers by , the set of natural numbers by , and the number of elements in a finite set by . With each actor in an HSDF specification , we associate an integer , which denotes the execution time estimate of , and an integer , which denotes the processor to which is assigned in the assignment step. Each edge has a nonnegative integer delay associated with it, which is denoted by . These delays represent initial tokens, and specify dependencies between iterations of actors in iterative execution. For example, if the tokens produced by an actor on its th invocation are consumed by actor on its th invocation, the edge between and would have a delay of 2. Every edge induces the precedence constraint
where denotes the starting time of the th invocation of an actor . Here, is set to 0 for as an initial condition.
A path in a directed graph is a finite sequence , where each is in , and , for
. We say that the path is directed from to . A path that is directed from some vertex to itself is called a cycle. Given a path , the path delay of , denoted , is given by (2) Each cycle in a dataflow graph must satisfy to avoid deadlock.
The evolution of a self-timed implementation can be modeled by Sriram's IPC graph model [12] . Given an application graph and an associated self-timed schedule, the IPC graph, denoted , is constructed by instantiating a vertex for each application graph actor, connecting an edge from each actor to the actor that succeeds it on the same processor, and adding an edge that has unit delay from the last actor on each processor to the first actor on the same processor. Also, for each application graph edge that connects actors that execute on different processors, an inter-processor edge is instantiated in from to . A sample application graph and a self-timed schedule are illustrated in Fig. 1 , and the corresponding IPC graph is illustrated in Fig. 2 .
IPC costs (estimated transmission latencies through the multiprocessor network) can be incorporated into the IPC graph model by explicitly including communication (send and receive) actors, and setting the execution times of these actors to equal the associated IPC costs.
The IPC graph is an instance of Reiter's computation graph model [13] , also known as the timed marked graph model in Petri net theory [14] , and from the theory of such graphs, it is well known that in the ideal case of unlimited bus bandwidth, the average iteration period for the ASAP execution of an IPC graph is given by the maximum cycle mean (MCM) of , which is defined by (3) Fig. 1 . Example of an application graph and an associated self-timed schedule.
The quotient in (3) is referred to as the cycle mean of the associated cycle . A similar data structure that is useful in analyzing OT implementations is Sriram's ordered transaction graph model [12] . Given an ordering for the communication actors in an IPC graph , the corresponding ordered transaction graph is defined as the directed graph , where (4) for , and . Thus, an IPC graph can be modified by adding edges obtained from the ordering to create the ordered transaction graph.
II. PREVIOUS WORK
Sriram and Lee [9] , [12] discuss some of the advantages and disadvantages of the OT strategy compared to the ST strategy, namely lower synchronization and arbitration costs for the IPC mechanism at the expense of some run-time flexibility and the small additional hardware cost of a simple transaction controller. They also develop a method to compute an optimum transaction order when a fully-static schedule is given beforehand. In this approach, a set of inequalities is constructed using the timing information of the given FS schedule and represented as a graph. The Bellman-Ford shortest path algorithm is applied to this graph to obtain new starting times of the actors, thereby modifying the original FS schedule. A transaction order is then obtained by sorting the starting times of the communication actors. We shall term this method of finding the transaction orders, which is an efficient polynomial-time algorithm, the Bellman-Ford-Based (BFB) method. Under an assumption that the cost (latency) of IPC is zero, Sriram shows that the transaction order determined by the BFB technique is always optimal.
More specifically, the developments in [9] show that optimal transaction orders can be derived in polynomial time if IPC costs are negligible; however, the performance of the self-timed schedule is an upper bound on the performance of corresponding ordered transaction schedules under negligible IPC costs. Conversely, we show in this paper that when IPC costs are not negligible, as is frequently and increasingly the case in practice, the problem of determining an optimal transaction order is NP-hard, but at the same time, the performance of a self-timed schedule can be exceeded significantly by a carefully constructed transaction order. Thus, constructing optimal transaction orders is harder under nonnegligible IPC costs, but the potential benefit of employing efficient transaction orders is greater. Furthermore, under nonzero IPC costs, we must resort to heuristics for efficient solutions, the polynomial-time BFB algorithm is no longer optimal, and alternative techniques that account for IPC costs are preferable.
Numerous approaches have been proposed for incorporating IPC costs into the assignment and ordering steps of scheduling (e.g., [4] , [15] ). The techniques that we propose in this paper are complementary to these approaches in that they provide a means for mapping the resulting schedules into efficient OT implementations, which eliminate the performance and power consumption overhead associated with run-time synchronization and contention resolution.
For multiprocessor networks utilizing point-to-point links, Surma and Sha [16] , [17] have investigated static scheduling of messages using a collision graph model, and have shown that embedding this model within existing scheduling algorithms can lead to significant improvement.
III. COMPARISON OF SELF-TIMED AND ORDERED TRANSACTION STRATEGIES
Given an application graph, an associated multiprocessor schedule, and an FS implementation, an OT implementation, and an ST implementation for the schedule, suppose , and , respectively, denote the average iteration periods of the corresponding schedules. In general, when IPC costs are negligible, [12] . This is because the ST method has the fewest constraints. The ST schedule only has assignment and ordering constraints, while the OT schedule has transaction ordering constraints in addition to those constraints, and the FS schedule has exact timing constraints that subsume the constraints in the ST and OT schedules. ST schedules overlap in a natural manner, and eventually settle into a periodic pattern of iterations. This pattern can be exponential in size, and therefore, the ST schedule has the advantage that in successive iterations, the transaction order may be different, while this flexibility is not available for the OT and FS schedules.
In practical cases, however, the IPC cost is nonzero. Depending on the bandwidth of the bus, IPC costs may be quite significant. The throughput of the ST schedule can be computed easily when IPC costs are ignored by calculating the MCM of the corresponding dataflow graph (i.e., via (3). However, when IPC costs are taken into account, this can no longer be done since the notion of bus contention comes into the picture. Not only do the communication actors in the dataflow graph have to wait for sufficient tokens on the input arcs to fire, they also have to wait for the bus to be available-i.e., no other communication actor should be accessing the bus at the same instant of time. Therefore, the throughput of the self-timed schedule is typically derived using simulation techniques, which are time-consuming. On the other hand, the throughput of the OT schedule can still be obtained by calculating the MCM of the transaction order graph since there will be no bus contention when a linear order is imposed on the communication actors [9] .
The relation is also no longer valid in the presence of nonzero IPC costs. To see why this is true, assume that two communication actors become enabled (have sufficient input tokens to fire) at more or less the same time. Then the ST method will schedule the communication actor that becomes enabled earlier. Doing this may result in a lower throughput since, for example, the processor that contains the communication actor that is scheduled later might be more heavily loaded. The FS and the OT methods avoid such pitfalls by analyzing the schedules at compile time, and producing an exact firing time assignment, or a transaction order that takes the entire schedule into consideration. Intuitively, the ST method follows a more greedy, ASAP approach in choosing which communication actor to schedule next, and this can result in inefficient execution patterns.
Example 1: To illustrate how an ST schedule might perform worse than an OT schedule, consider the IPC graph of Fig. 2 . Dashed edges represent inter-processor data dependencies. Numbers beside actors show their execution times, numbers beside edges indicate nonzero delays, denotes the th send actor of computation actor , and denotes the th receive actor of . Fig. 3 shows the periodic pattern into which the ST schedule eventually settles. Although Processor 1 is most heavily loaded, we see that there are instances when the processor is idling waiting for the bus to become free. In contrast, when the transaction order is enforced (Fig. 4) , an 11% lower average iteration period results. This is because the transaction order is computed in a fashion that enables the heavily loaded Processor 1 to access the bus whenever required. Such an ability to prioritize strategically-selected transactions is especially important in heterogeneous multiprocessors, which often have unbalanced loads due to variations in processing capabilities of the computing resources.
The ST approach has the further disadvantage that in the presence of execution time uncertainties, there is no known method for computing a tight worst-case iteration period, even using simulation techniques. In particular, the period of the ST schedule obtained by using worst case execution time estimates of the actors does not necessarily give us the worst case iteration period of a schedule. This can prove to be a big disadvantage in real-time systems where worst-case bounds are needed beforehand.
Example 2: Consider the IPC graph of Fig. 5 , and suppose that Actor 1 has a worst-case execution time of 21, and a best case execution time of 19. Fig. 6 shows the ST schedule that results when Actor 1 has an execution time of 21. An iteration period of 50 is obtained. However, when the same schedule is simulated for an execution time of 19, we obtain an iteration period of 59 as shown in Fig. 7 .
In contrast, the iteration period obtained by computing the MCM of the ordered transaction graph with worst-case actor execution times is the worst-case iteration period. This is because the MCM is an accurate measure of performance for ordered transaction implementations [9] , [12] and the MCM can only increase or remain the same when the execution time of an actor is increased.
IV. FINDING OPTIMAL TRANSACTION ORDERS
In the transaction ordering problem, our objective is to determine a transaction order for a given IPC graph such that the MCM of the resulting ordered transaction graph is minimized (so that throughput is maximized). As mentioned in Section II, it has been shown that this problem is tractable when IPC costs are ignored. In this section, we show that when IPC costs are considered, the transaction ordering problem becomes NP-complete.
We show this by first showing that determining an optimal transaction order for noniterative implementations, which is a more restricted (easier) problem, is NP-complete. To convert an iterative IPC graph to a noniterative one, it suffices to remove all edges in the graph that have delays of one or more. This results in an acyclic graph since any cycle in the original graph must have a delay of one or more for the graph not to be deadlocked. (4).
By definition, the total execution time (makespan) of a NOT graph is finite, and this execution time can be determined in polynomial time-as the length of the longest cumulativeexecution-time path in -since is acyclic and the execution times of all actors are nonnegative. However, given an IPC graph, finding a transaction order that minimizes the makespan of the associated NOT graph is intractable.
Definition 3: The noniterative transaction ordering problem is defined as follows. Given an NIPC graph , and a positive integer , does there exist a transaction order such that has a makespan that is less than or equal to ?
To show that noniterative transaction ordering is NP hard, we derive a reduction from the sequencing with release times and deadlines (SRTD) problem, which is known to be NP-complete [18] . The SRTD problem is defined as follows.
Definition 4: (The SRTD problem). Given an instance set of tasks, and for each task , a length (duration) , a release time , and a deadline , is there a single-processor schedule for that satisfies the release time constraints and meets all the deadlines? That is, is there a one-to-one function (called a valid SRTD schedule) , with , and for all , and ? Theorem 1: The noniterative transaction ordering problem is NP-complete.
Proof: This problem is clearly in NP since we can verify in polynomial time whether the longest path length (in terms of cumulative execution time) of the graph is less than or equal to a given positive integer. Now suppose that we are given an instance of the SRTD problem with . We construct an NIPC graph from this instance by carrying out the following steps. Here, all edges instantiated are delay-less unless otherwise specified. Let be equal to at least the maximum deadline of the tasks in the given instance of the SRTD problem. instantiate an edge and another edge .
Each send actor is connected to the receive actor by an interprocessor edge with a delay of unity. Since each of the interprocessor edges has a delay of unity, these edges are not present in . Without loss of generality, we assume that there are an even number of tasks, so that the number of send and receive actors is the same (if the number of tasks is not even to begin with, we can instantiate an appropriately-defined dummy actor to generate an equivalent "even-task" instance). Observe from our construction that from the tasks in the given instance of the SRTD problem, we construct a graph that involves processors, communication actors, computation actors, and edges. Claim: If there exists a transaction order for that will have a makespan that is less than or equal to , then there exists a valid SRTD schedule for the given instance of the SRTD problem.
The reasoning behind our construction and the above claim is that we make the communication actors of the ordered transaction correspond exactly to the tasks of the SRTD problem. We do this by making the execution time of the computation actor before each corresponding communication actor equal to the release time of the associated task and, thus, guarantee that the communication actors cannot begin execution before their respective release times. Also, since computation actors will begin execution from time 0 as each is on a different processor, the release times correspond to the time at which they complete execution. Similarly, the execution times of the computation actors that follow the communication actors are chosen to be so that the corresponding communication actors must complete their execution before for the makespan to be less than or equal to . This is true because the computation actor can begin execution immediately after the communication actor has finished. Therefore, the valid SRTD schedule corresponds exactly to the shared bus schedule in the derived instance of the noniterative transaction ordering problem. If we can find a transaction order that has a makespan less than or equal to , we have a bus schedule that schedules the communication actors in the same manner as an appropriate single-processor schedule for the corresponding SRTD tasks. Conversely, if a transaction order cannot be found that satisfies the given makespan constraint, it is easily seen that there is no valid SRTD schedule for the given instance of the SRTD problem.
. Note that in Theorem 1, we have simplified the problem greatly by assuming the interprocessor edges to have unit delays. This removes the interdependencies that are imposed by these edges, but even with this simplification, the problem remains NP-complete.
Example 3: Suppose that we are given an instance of the SRTD problem with task set ; and respective release times , lengths , and deadlines . To construct an instance of the noniterative transaction ordering problem with , we create 4 processors each with 3 vertices. The execution times are determined from above-e.g.,
. The resulting NOT graph is illustrated in Fig. 8 . Dash-dot edges indicate OT edges. Removing the dash-dot edges that represent the transaction order edges gives us the NIPC graph constructed from above. This figure shows a transaction order where the schedule length of 11 is satisfied. This means that there exists a valid SRTD schedule for the given SRTD problem instance. The start times of the tasks can be obtained by finding the longest path lengths between the source nodes and the corresponding communication actors. Setting the starting times of the tasks to equal , respectively, we obtain a valid SRTD schedule for the SRTD problem instance.
As demonstrated by Theorem 2 below, we can extend the Proof of Theorem 1 to show that the transaction ordering problem is NP-complete in the iterative context as well as the noniterative case.
Definition 5: The iterative transaction ordering problem (also called the transaction ordering problem) is defined as follows. Given an IPC graph and a positive integer , does there exist a transaction order such that satisfies ?
Theorem 2: The iterative transaction ordering problem is NP-complete.
Proof Sketch: The MCM can be found in polynomial-time, therefore, the problem is in NP.
To establish NP-hardness, we again derive a reduction from the SRTD problem, and we modify the graph construction from the Proof of Theorem 1 so that the MCM equals the makespan. Details are omitted due to page limitations.
V. THE TRANSACTION PARTIAL ORDER HEURISTIC
The BFB technique does not take bus contention into consideration while scheduling the transaction order. Instead, it tries to find a transaction order that will be close to or equal to that of the associated self-timed schedule. However, we have demonstrated that in the presence of nonzero IPC costs, the OT method can, in fact, perform significantly better than the ST method, and thus, more direct consideration of OT execution is clearly worthwhile when scheduling transactions. For this purpose, we propose in this section a heuristic, called the transaction partial order (TPO) algorithm, that simultaneously takes IPC costs and the serialization effects of transaction ordering into account when determining the transaction order. Note that OT edges added to the IPC graph can only increase the MCM of the IPC graph, or leave the MCM unchanged. The MCM of the original IPC graph therefore represents a lower bound on the achievable average iteration period. By adding OT edges, we are effectively removing bus contention by making sure that no two communication actors submit conflicting bus requests, and this generally increases the MCM of the IPC graph. The TPO heuristic finds a transaction order on the basis that an OT edge that increases the MCM of the IPC graph by a comparatively smaller amount should be given preference. Therefore, to determine which communication actor should be scheduled first, we insert OT edges between communication actors that are contending for the bus (during the transaction ordering process), and calculate the corresponding MCM of the IPC graph. Actors whose corresponding MCM's are more favorable under such an evaluation are scheduled earlier in the transaction order.
We note that in a correct transaction order, the OT edges should not introduce any zero delay cycles into the IPC graph. Such cycles would create deadlock in the system. Recall that if we remove the nonzero delay edges from , the resulting graph is acyclic. We define a graph which is constructed by adding the OT edges to . In a correct transaction order, is also acyclic, and so we can perform a topological sort of . In any correct transaction ordering, there must exist some topological sort of such that for all OT edges , the source vertex of precedes the target vertex of in the topological sort. We therefore see that the solution space of possible transaction orderings is a subset of the possible topological sorts of . Unfortunately, there can be an exponential number of possible topological sorts for a graph. For example, a complete bipartite graph with nodes has possible topological sorts.
Instead of operating directly on , we derive a transaction partial order (TPO) graph . The transaction partial order graph incorporates the minimum set of dependencies imposed among different processors by the communication actors of the IPC graph. These dependencies must be obeyed by any ordering of the communication operations. Fig. 9 gives pseudocode for generating . The algorithm attempts to remove as many noncommunication nodes as possible while maintaining dependencies between communication actors. Since we allow for IPC graphs with overlapped communication and computation, some of the computation nodes have both multiple predecessors and multiple successors. These cannot be removed since this would require imposing some additional dependencies on these nodes, and we want to represent the minimal set of dependencies. The algorithm proceeds in multiple passes, and terminates when no more computation actors can be removed.
Example 4: Fig. 10 shows a TPO graph derived from an IPC graph using the algorithm given in Fig. 9 . Two passes were required to reduce the graph. Notice that all the dependencies imposed by the IPC graph are retained in the TPO graph.
The TPO heuristic proceeds by considering-one by one-each vertex of that has no input edges (vertices in the TPO graph that have no input edges are called ready vertices) as a candidate to be scheduled next in the transaction order. Interprocessor edges are drawn from each candidate vertex to all other ready vertices in , and the corresponding MCM is measured. The candidate whose corresponding MCM is the least when evaluated in this fashion is chosen as the next vertex in the ordered transaction, and deleted from . This process is repeated until all communication actors have been Fig. 10 . The TPO graph derived from the IPC graph in (a) after 1 pass (b) and two passes (c) of the algorithm given in Fig. 9 . Note the labeling of the IPC tasks (shaded)-s(i; j) represents a send from computational task i to computational task j while r(i; j ) represents a receive from computational task i to computational task j .
scheduled into a linear ordering. A pseudocode specification of the TPO heuristic is given in Fig. 11 .
The algorithm makes sense intuitively since the dependencies imposed by the edges drawn from the candidate vertices will remain when the transaction ordering is enforced. These edges represent constraints in addition to the interprocessor edges that are already present in and, thus, they can only increase the MCM or leave the MCM unchanged. Since we are interested in minimizing the MCM, we choose candidate vertices that increase the MCM by the least possible amounts. Thus, the algorithm follows a greedy strategy in choosing vertices, but it explicitly takes communication serialization and IPC costs into account.
When we apply the TPO heuristic to the IPC graph of Fig. 2 , the schedule we obtain is illustrated by the Gantt chart of Fig. 4 .
The OT edges corresponding to the actors that have already been scheduled are added as the heuristic proceeds since they represent the schedule of the bus, and hence, make the heuristic more accurate for the later stages of the transaction order. The maximum number of nodes in the ready list at any given instant is (where is the number of processors). The complexity of the algorithm is thus since the complexity of computing the MCM of a graph is . The edge of the transaction order that connects the last communication actor in the ordering with the first one has a delay of unity (to represent the transition to the next graph iteration). We can improve the performance of the TPO algorithm by introducing this edge at the beginning because it will give a more accurate estimate of the MCM in choosing vertices later as the heuristic proceeds. Under this modification, the heuristic proceeds as before, except that the "last" (unit-delay) transaction ordering edge is drawn at the beginning. Since has a maximum of communication actors that can be scheduled last in the transaction order, the modified heuristic is repeated for each of these candidate communication actors that can be scheduled in the end, and the best solution that results is selected. This increases the complexity of the algorithm by a factor of to .
VI. BRANCH AND BOUND STRATEGY
Since the transaction ordering problem is intractable, we are unable to efficiently find optimal transaction orders on a consistent basis. We have implemented a branch and bound (BB) strategy to explore the search space comprehensively. The branch and bound approach gives us a lower bound on the iteration period that can be achieved and provides us with a useful measure of comparison.
The branch and bound strategy initially computes the list of actors that are ready from . A candidate vertex is chosen and deleted from and the ready list is updated. In successive iterations, an edge is drawn in from the previous candidate actor to the current candidate and the MCM is computed. If the computed MCM is less than the lowest calculated MCM then the process is repeated recursively; otherwise, the candidate is discarded and the next one chosen. The branch and bound approach has the advantage of being able to prune the search space effectively but since the MCM has to be computed after each edge is added rather than after all the edges have been added, it is unable to find optimum transaction orders for complex graphs.
VII. GENETIC ALGORITHM FOR TRANSACTION SCHEDULING
The branch and bound approach requires excessive amounts of time for graphs that have significant numbers of IPC edges. To develop an alternative to this branch and bound approach, and the TPO heuristic, we have implemented a genetic algorithm (GA) to search for the best transaction order. The GA exploits the increased tolerance for compile time that is available for many embedded applications [19] , and can leverage the TPO heuristic by incorporating its solution in the "initial population."
In our GA formulation, candidate transaction orders are encoded using the matrix-based sequence-encoding method described in [20] . Using this method, the partial order of the communication actors is converted into a precedence matrix and randomly completed to yield a random transaction order that is valid. Mutation is carried out by swapping rows and columns, and recombination is performed using the intersection operator explained in [20] . The intersection operator takes subsequences that are common among the parents by taking the boolean "and" of the two parent matrices to form the "offspring," and the undefined part is randomly completed.
The mutation step takes time multiplied by the number of swaps carried out since each time we have to check whether the swap was valid by comparing it with the partial boolean matrix corresponding to the transaction partial order graph . The recombination step takes time, and the evaluation step takes time. The overall complexity of each iteration is also influenced by the population size and the overhead involved in generating random numbers.
VIII. DYNAMIC REORDERING
Once we obtain a transaction order (e.g., using the TPO heuristic or the GA approach defined in Section VII), it is possible to swap the position of consecutive communication actors in the transaction order as long as the new positions do not violate the dependencies imposed by the transaction partial order. This method has the advantage that it cannot degrade the transaction order since we can discard any solution that is worse. The concept is similar to dynamic variable reordering used in ordered binary decision diagrams (OBDDs) [21] . OBDD's are data structures used for representation and manipulation of Boolean functions often applied in VLSI CAD. The choice of the variable ordering largely determines the size of the BDD; its size may vary from linear to exponential. Dynamic reordering is a technique by which neighboring variables are swapped to find a better variable order. Swapping of neighboring variables can be performed in linear time. We have implemented an adaptation to ordered transaction scheduling, called dynamic transaction reordering (DTR), of the sifting algorithm introduced by Rudell [22] . The sifting algorithm tries to find the best position for a variable. Each variable is exchanged with its successor variable until the variable is sifted to the bottom of the directed acyclic graph (DAG). Then the variable is exchanged with its predecessor variable until it is shifted to the top of the DAG. The best DAG size seen during the search is remembered and the position of the variable is restored. We have observed that from DTR, we consistently obtain improvements in the iteration period, regardless of the method used to find the transaction order.
IX. SIMULATOR
We have developed a software simulator of the execution of self-timed iterative schedules. It was developed under UNIX using the LEDA C++ programming library [23] . The simulated system is a shared-memory architecture. In this architecture, synchronizations are performed by accessing the shared memory bus. Data tokens associated with the IPC constraints are transferred via the shared memory. When a processor tries to gain access to the bus for a synchronization or interprocessor data communication, the system permits access if the bus is not in use, otherwise it denies access and the processor waits a specified back-off time, then tries again.
The synchronization cost for OT is much lower than the synchronization costs for ST. In the OT strategy a shared bus access takes no more than a single read or write cycle on the processor, and the overall cost of communicating one data sample is two or three instruction cycles [12] Our simulator for ST operation implements both the Unbounded Buffer Synchronization (UBS) and the Bounded Buffer Synchronization (BBS) protocols [10] . In the BBS protocol (used with feedback synchronization edges), the protocol requires one local memory increment operation (the local write pointer) and one write to shared memory (store write pointer) occur after the source node of the synchronization edge has executed. The initial value of the write pointer is set to the delay of the synchronization edge . The BBS synchronization before the execution of involves a comparison between the value stored in shared memory and a local value (the local read pointer). The initial value of the local read pointer is 0. This comparison is repeated until the read pointer and the saved write pointer are not equal, at which time the actor can execute.
In the UBS protocol (used with feed-forward synchronization edges), the local read and write pointers are maintained and initialized in the same manner as in the BBS protocol. The value stored in shared memory is no longer a copy of the write pointer, but is rather a count (initialized to the edge delay) of the number of unread tokens. After executes, the shared count is repeatedly examined until it is found to be less than the feed-forward buffer size, at which point the IPC operation can proceed. The count is incremented and can execute. Before executes, the value of the shared count is repeatedly checked until it is nonzero. Then the read operation can proceed, and the count is decremented.
We assume an architecture where all synchronization and memory accesses occur in a single shared memory. We define a parameter to be the ratio of the synchronization time to the IPC time. Since we are considering HSDF graphs with one data token produced per IPC operation, we have for BBS (at least 2 memory accesses for synchronization for every data memory access) and for UBS.
A. Task Execution Times
For many DSP applications, it is possible to obtain accurate statistics on task execution times. Probabilities for events such as cache misses, pipeline stalls, and conditional branches can be obtained by using sampling techniques or simulation of the target architecture [24] . We utilize the task execution model in [25] , where each task in the task graph is associated with three possible execution times , or with probabilities , and respectively. Here, is the task execution time given in the benchmark specification, and . We define a single parameter for the degree of randomness of the task execution times, where , and . Note that for this probability distribution, corresponds to the highest degree of randomness. Under these assumptions, we note that the FS strategy is not practical for any , no matter how small, since a FS architecture must operate with for all tasks in order to assure correctness.
X. RESULTS
Experiments were carried out to compare the ST method and the OT method, and to measure the performance of the TPO, GA, and DTR heuristics in finding transaction orders. The algorithms presented in Sections V-VIII were implemented in C/C++ using the LEDA [23] framework for fundamental graph theoretic data structures and algorithms. The benchmarks are standard DSP applications that have been scheduled using the DLS algorithm [26] .
The IPC graphs are fairly complicated, ranging from between 9-764 nodes, and the numbers of processors involved range from 3 to 8. The examples fft1, fft2, and fft3 result from three representative schedules for Fast Fourier Transforms based on examples given in [20] ; karp10 is a music synthesis application based on the Karplus Strong algorithm in 10 voices; the video coder is taken from [27] , cddat is a CD to digital audio tape converter taken from the Ptolemy suite [1] , laplace is a Laplace transform from [28] , and irr is an adaptation of a physics algorithm [29] .
The iteration period for the ST schedule is obtained using the simulator described in Section IX, while the iteration period of the OT schedule is obtained from the MCM of . We used the task execution model from Section IX-A, and calculated the average iteration periods over 10 000 iterations.
We define a parameter that quantifies the IPC overhead in a given schedule. It is calculated from the ratio of the total IPC time (synchronization plus data communication) to the total execution time spent on computational tasks over all processors. Thus, is a function of , the schedule, and the relative speed of processor to memory. We note that VLSI is trending toward higher relative processor-to-memory speeds (higher ) as gate lengths decrease. Fig. 12 plots and versus the parameter that governs the degree of randomness of the task execution times, for different values of , the ratio of synchronization-to-IPC overhead described in Section IX. The calculations for do not correspond to any synchronization protocol in our architectural model, but are given as a point of reference. Values of would be possible if a separate (faster) memory were allocated to the synchronization variables. Fig. 13 plots these same measures for the fft1 and Laplace benchmarks.
The iteration period increases with since the average execution time increases with . For many DSP applications is a reasonable assumption. For example, if corresponds to a cache miss, to a processor pipeline stall, and to the base execution time for a task, corresponds to a 1% cache miss probability, a 9% pipeline stall probability, and a 90% probability for the base execution time.
From Fig. 12 we see that increases more slowly as a function of than does . This is because the self-timed schedule is able to adapt to changes in execution times, while the OT schedule is fixed. This behavior was observed with all the benchmarks. We also see that it is possible for OT to outperform ST for , as explained in Section III, but only for small . Comparing Fig. 12(a) and (b) , we see that the slopes of the curves decrease as increases. This is because the IPC operations are not random, and so as IPC increases, a smaller fraction of the overall execution time comes from random tasks.
We also observe that the relative improvement of OT over ST increases as increases. Figs. 14 and 15 plot the ratio versus for the irr and fft1 benchmarks. For the irr benchmark, for all and . For the fft1 benchmark with and , and elsewhere. As discussed above, represents a high degree of uncertainty for task execution times in DSP applications (with representing the highest possible degree of randomness in the probability distribution). Table I compares the performance (iteration period) of the ST and the OT schedules. In all cases, we observe that the OT strategy outperforms the ST strategy for . As noted before, and represent lower bounds for the BBS and UBS synchronization protocols, respectively. Table II gives us a comparison between the different heuristics in finding transaction orders. Each entry is the iteration period when the transaction order found by the heuristic is enforced. Column 2 shows the iteration period when a randomlygenerated transaction order is enforced and indicates the iteration period when DTR is used to refine the transaction order obtained using TPO and the solution is incorporated into the initial population of the GA. From the table we can conclude that all the heuristics work fairly well compared to the random transaction order. The TPO heuristic for which the results have been shown is the modified version that inserts the delay beforehand. This consistently gives us a slight improvement. Generally, the TPO heuristic works better than the BFB technique-especially for cddat-and the heuristic that combines the TPO heuristic and DTR performs quite well (even better than the GA which, takes significantly more time to execute). The GA was implemented with a population size of 100 and the number of iterations was set to 1000. The GA for the experiments that we tried generally stabilized before the 1000 iteration limit was reached.
When we use the transaction ordering obtained by the TPO heuristic combined with DTR in the initial population of the GA, we achieve the best results since we simultaneously obtain the benefits of all three approaches. The branch and bound strategy shows us the lower bound that the OT method can achieve but since the complexity of the BB method is exponential, we are unable to achieve results for larger benchmarks. The results are shown in Table II .
XI. CONCLUSION
We have demonstrated that the ordered transaction method-which is superior to the self-timed method in its predictability, and its total elimination of synchronization overhead-can significantly outperform self-timed implementation, even though the ordered transaction implementation offers less run-time flexibility due to a fixed ordering of communication operations. AWhen synchronization cost is taken into account, the ordered transaction method performs significantly better than the self-timed method. We have studied the relative behavior of OT and ST implementations under a realistic model for task execution times. The OT strategy performs better relative to the ST strategy for lower (degree of randomness in task execution times), higher (synchronization costs), and We have also shown that in the presence of nonzero IPC costs, finding an optimal transaction order is an NP-complete problem, and we have developed a variety of heuristic techniques to find efficient transaction orders. These techniques include a low-complexity, deterministic heuristic for rapid design space exploration, a dynamic reordering technique for improving a given transaction order, and a genetic algorithm for exploiting extra compile time when generating final implementations. We have also developed a detailed simulator to measure the performance of the self-timed schedule under different constraints.
The benefits of OT can be expected to increase with the general trend in VLSI technology for increasing processor/memory performance disparity. Some of this benefit may be offset, however, by another trend, which is for decreasing predictability in application behavior (and thus execution times) due to the use of more and more sophisticated and adaptive types of algorithms. The evolution of the MPEG standards is an example of this. Further research on OT methods to efficiently handle such lower degrees of predictability is therefore an interesting and important direction for further study.
Useful directions for further work include incorporating transaction ordering considerations into the scheduling process; integrating transaction ordering and retiming of synchronous dataflow graphs [30] ; the exploration of hybrid scheduling strategies that can combine ordered transaction, self-timed, and fully-static strategies in the same implementation based on subsystem characteristics; and extension to more general DSP modeling techniques, such as stream-based functions [31] , cyclo-static dataflow [32] , multidimensional dataflow [33] , and parameterized dataflow [34] .
