This paper explores the problem of efficiently ordering interprocessor communication operations in statically-scheduled multiprocessors for iterative dataflow graphs. In most digital signal processing applications, the throughput of the system is significantly affected by communication costs. By explicitly modeling these costs within an effective graph-theoretic analysis framework, we show that ordered transaction schedules can significantly outperform self-timed schedules even when synchronization costs are low. However, we also show that when communication latencies are non-negligible, finding an optimal transaction order given a static schedule is an NP-complete problem, and that this intractability holds both under iterative and non-iterative execution. We develop new heuristics for finding efficient transaction orders, and perform an experimental comparison to gauge the performance of these heuristics.
Background
This paper explores the problem of efficiently ordering interprocessor communication (IPC) operations in statically-scheduled multiprocessors for iterative dataflow specifications. An iterative dataflow specification consists of a dataflow representation of the body of a loop that is to be iterated indefinitely. Dataflow programming in this form is used widely in the design and implementation of digital signal processing (DSP) systems.
In this paper, we assume that we are given a dataflow specification of an application, and an associated multiprocessor schedule (e.g., derived from scheduling techniques such as those presented in [6, 9, 18, 22] ). Our objective is to reduce the overall IPC cost of the multiprocessor implementation, and the associated performance degradation, since IPC operations result in significant execution time and power consumption penalties, and are difficult to optimize thoroughly during the scheduling stage. IPC is assumed to take place through shared memory, which could be global memory between all processors, or could be distributed between pairs of processors (e.g., hardware first-in-first-out queues or dual ported memory). Such simple communication mechanisms, as opposed to cross bars and elaborate interconnection networks, are common in embedded systems, due to their simplicity and low cost.
Scheduling dataflow graphs
Our study of multiprocessor implementation strategies in this paper is in the context of homogeneous synchronous dataflow (HSDF) specifications. In HSDF, an application is represented as a directed graph in which vertices (actors) represent computational tasks of arbitrary complexity; edges (arcs) specify data dependencies; and the number of data values (tokens) produced and consumed by each actor is fixed. An actor executes (fires) when it has enough tokens on its input arcs, and during execution, it produces tokens on its output arcs. HSDF imposes the restriction that on each invocation, each actor consumes exactly one token from each input arc, and produces one token on each output arc. HSDF and closely-related models are used extensively for multiprocessor implementation of embedded signal processing systems (e.g., see [6, 10, 11, 12] ). We refer to an HSDF representation of an application as an application graph.
For multiprocessor implementation of dataflow graphs, actors in the graph need to be scheduled. Scheduling can be divided into three steps [13] -assigning actors to processors (processor assignment), ordering the actors assigned to each processor (actor ordering), and determining when each actor should commence execution. All of these tasks can either be performed at run-time or at compile time to give us different scheduling strategies. To reduce run-time overhead and improve predictability, it is often desirable in embedded applications to carry out as many of these steps possible at compile time [13] .
Typically, there is limited information available at compile time since the execution times of the actors are often estimated values. These may be different from the actual execution times due to actors that display run-time variation in their execution times because of conditionals or data-dependent loops within them, for example. However, in a number of important embedded domains, such as DSP, it is widely accepted that execution time estimates are reasonably accurate, and that good compile-time decisions can be based on them. In this paper, we focus on scheduling methods that extensively make use of execution time estimates, and perform the first two stepsprocessor assignment and actor ordering -at compile time.
In relation to the scheduling taxonomy of Lee and Ha [13] , there are three general strategies with which we are primarily concerned in this paper. In the fully-static (FS) strategy, all three scheduling steps are carried out at compile time, including the determination of an exact firing time for each actor. In the self-timed (ST) strategy, on the other hand, processor assignment and actor ordering are performed at compile time, but run-time synchronization is used to determine actor firing times: an ST schedule executes by firing each actor invocation as soon as it can be determined via synchronization that the actor invocations on which is dependent have all completed execution.
A A
The FS and ST methods represent two extremes in the class of scheduling algorithms considered in this paper. The ST method is the least constrained scheme since the only constraints are the IPC dependencies, and it is tolerant of variations in execution times, while the FS strategy only works when tight worst case execution times are available, and forces system performance to conform to the available worst case bounds. When we ignore IPC costs, the ST schedule consequently gives us a lower bound on the average iteration period of the schedule since it executes in an ASAP (as soon as possible) manner.
The ordered transaction (OT) method [11, 23] falls in-between these two strategies. It is similar to the ST method but also adds the constraint that a linear ordering of the communication actors is determined at compile time, and enforced at run-time. The linear ordering imposed is called the transaction order of the associated multiprocessor implementation.
The FS and OT strategies have significantly lower overall IPC cost since all of the sequencing decisions associated with communication are made at compile time. The ST method, on the other hand, requires more IPC cost since it requires synchronization checks to guarantee the fidelity of each communication operation -that is, to guarantee that buffer underflow and overflow are consistently avoided. Significant compile-time analysis can be performed to streamline this synchronization functionality [3, 4] .
The metric of interest to us in this paper is the average iteration period . Intuitively, in an iterative execution of a dataflow graph, the iteration period is the number of cycles that it takes for each of the actors in the graph to execute exactly once -i.e., to complete a single graph iteration. Note that it is not necessary in a self-timed schedule for the iteration period to be the same from one graph iteration to the next, even when actor execution times are fixed [24] . The inverse of the average iteration period gives us the throughput , which is the average number of
T T T
1 -graph iterations carried out per unit time.
Terminology and notation
We denote the set of positive integers by , the set of natural numbers by
, and the number of elements in a finite set by .
With each actor in an HSDF specification , we associate an integer , which denotes the execution time estimate of , and an integer , which denotes the processor that is assigned to in the assignment step. Each edge has a non-negative integer delay associated with it, which is denoted by . These delays represent initial tokens, and specify dependencies between iterations of actors in iterative execution. For example, if the tokens produced by an actor on its th invocation are consumed by actor on its th invocation, the edge between and would have a delay of 2.
Every edge induces the precedence constraint ,
where denotes the starting time of the invocation of an actor . Here, is set to for as initial conditions.
A path in a directed graph is a finite sequence , where each is in , and , for . We say that the path is directed from to . A path that is directed from some vertex to itself is called a cycle. Given a path , the path delay of , denoted , is given by .
Each cycle in a dataflow graph must satisfy to avoid deadlock.
The evolution of a self-timed implementation can be modeled by Sriram's IPC graph [24] . Given an application graph and an associated self-timed schedule, the IPC graph, denoted , is constructed by instantiating a vertex for each application graph actor, connecting an edge from each actor to the actor that succeeds it on the same processor, and adding an edge that has unit delay from the last actor on each processor to the first actor on the same processor.
Also, for each application graph edge that connects actors that execute on different processors, an inter-processor edge is instantiated in from to . A sample application graph and a self-timed schedule are illustrated in Figure 1 , and the corresponding IPC graph is illustrated in actors, and setting the execution times of these actors to equal the associated IPC costs.
The IPC graph is an instance of Reiter's computation graph model [20] , also known as the timed marked graph model in Petri net theory [19] , and from the theory of such graphs, it is well known that in the ideal case of unlimited bus bandwidth, the average iteration period for the ASAP execution of an IPC graph is given by the maximum cycle mean (MCM) of , which is G ipc xy , ()
The quotient in (3) is referred to as the cycle mean of the associated cycle .
A similar data structure that is useful in analyzing OT implementations is Sriram's ordered transaction graph model [24] . Given an ordering for the communication actors in an IPC graph , the corresponding ordered transaction graph is defined as the directed graph, where
for , and . Thus, an IPC graph can be modified by adding edges obtained from the ordering to create the ordered transaction graph.
Previous work
In [23, 24] , Sriram and Lee discuss some of the advantages and disadvantages of the OT strategy compared to the ST strategy -in particular, lower synchronization and arbitration costs for the IPC mechanism at the expense of some run-time flexibility. They also develop a method to compute an optimum transaction order when a fully-static schedule is given beforehand. In this approach, a set of inequalities is constructed using the timing information of the given FS schedule, and represented as a graph. The Bellman-Ford shortest path algorithm is applied to this graph to obtain new starting times of the actors, thereby modifying the original FS schedule. A transaction order is then obtained by sorting the starting times of the communication actors. We shall
term this method of finding the transaction orders, which is an efficient polynomial-time algorithm, the Bellman Ford Based (BFB) method. Under an assumption that the cost (latency) of IPC is zero, Sriram shows that the transaction order determined by the BFB technique is always optimum.
However, in this paper, we show that when IPC costs are not negligible, as is frequently and increasingly the case in practice, the problem of determining an optimal transaction order is NP-hard. Thus, under nonzero IPC costs, we must resort to heuristics for efficient solutions. Furthermore, the polynomial-time BFB algorithm is no longer optimal, and alternative techniques that account for IPC costs are preferable.
Numerous approaches have been proposed for incorporating IPC costs into the assignment and ordering steps of scheduling (e.g., [2, 22] ). The techniques that we propose in this paper are complementary to these approaches in that they provide a means for mapping the resulting schedules into efficient OT implementations, which eliminate the performance and power consumption overhead associated with run-time synchronization and contention resolution.
Comparison of self-timed and ordered transaction strategies
Given an application graph, an associated multiprocessor schedule, and an FS implementation, an OT implementation, and an ST implementation for the schedule, suppose , , and , respectively, denote the average iteration periods of the corresponding schedules. In general, when IPC costs are negligible, [24] . This is because the ST method has the least constraints. The ST schedule only has assignment and ordering constraints, while the OT schedule has transaction ordering constraints in addition to those constraints, and the FS schedule has exact timing constraints that subsume the constraints in the ST and OT schedules. ST schedules overlap in a natural manner, and eventually settle into a periodic pattern of iterations. This
pattern can be exponential in size, and therefore, the ST schedule has the advantage that in successive iterations, the transaction order may be different, while this flexibility is not available for the OT and FS schedules.
In practical cases, however, the IPC cost is non-zero. Depending on the bandwidth of the bus, IPC costs may be quite significant. The throughput of the ST schedule can be computed easily when IPC costs are ignored by calculating the MCM of the corresponding dataflow graph (i.e.,
via (3)). However, when IPC costs are taken into account, this can no longer be done since the notion of bus contention comes into the picture. Not only do the communication actors in the dataflow graph have to wait for sufficient tokens on the input arcs to fire, they also have to wait
for the bus to be available -i.e., no other communication actor should be accessing the bus at the same instant of time. Therefore, the throughput of the self-timed schedule is typically derived using simulation techniques, which are time-consuming. On the other hand, the throughput of the OT schedule can still be obtained by calculating the MCM of the transaction order graph since there will be no bus contention when a linear order is imposed on the communication actors [23] .
The relation is also no longer valid in the presence of non-zero IPC is enforced (Figure 4) , an 11% lower average iteration period results. This is because the transaction order is computed in a fashion that enables the heavily loaded Processor 1 to access the bus whenever it requires it. Such an ability to prioritize strategically-selected transactions is especially important in heterogeneous multiprocessors, which often have imbalanced loads due to large variations in processing capabilities of the computing resources.
The ST approach has the further disadvantage that in the presence of execution time uncertainties, there is no known method for computing a tight worst-case iteration period, even using simulation techniques. In particular, the period of the ST schedule obtained by using worst In contrast, the iteration period obtained by computing the MCM of the ordered transaction graph with worst-case actor execution times is the worst-case iteration period. This is because the MCM is an accurate measure of performance for ordered transaction implementations [23, 24] , and the MCM can only increase or remain the same when the execution time of an actor is increased.
Finding optimal transaction orders
In the transaction ordering problem, our objective is to determine a transaction order for a given IPC graph such that the MCM of the resulting ordered transaction graph is minimized (so that throughput is maximized). As mentioned in Section 2, it has been shown that this problem is tractable when IPC costs are ignored. In this section, we show that when IPC costs are considered, the transaction ordering problem becomes NP-complete.
We show this by first showing that determining an optimal transaction order for non-iterative implementations, which is a more restricted (easier) problem, is NP-complete. To convert an iterative IPC graph to a non-iterative one, it suffices to remove all edges in the graph that have delays of one or more. This results in an acyclic graph since any cycle in the original graph must O have a delay of one or more for the graph not to be deadlocked. (4).
By definition, the total execution time (makespan) of a NOT graph is finite, and this execution time can be determined in polynomial time -as the length of the longest cumulative-execution-time path in -since is acyclic and the execution times of all actors are nonnegative. However, given an IPC graph, finding a transaction order that minimizes the makespan of the associated NOT graph is intractable.
Definition 3:
The non-iterative transaction ordering problem is defined as follows. Given an NIPC graph , and a positive integer , does there exist a transaction order such that has a makespan that is less than or equal to ?
To show that non-iterative transaction ordering is NP hard, we derive a reduction from the sequencing with release times and deadlines (SRTD) problem, which is known to be NP-complete [8] . The SRTD problem is defined as follows.
Definition 4: (The SRTD problem). Given an instance set of tasks, and for each task , a length (duration) , a release time , and a deadline , is there a single-
processor schedule for that satisfies the release time constraints and meets all the deadlines?
That is, is there a one-to-one function (called a valid SRTD schedule) , with
, and for all , , and ?
Theorem 1: The non-iterative transaction ordering problem is NP-complete.
Proof: This problem is clearly in NP since we can verify in polynomial time whether the longest path length (in terms of cumulative execution time) of the graph is less than or equal to a given positive integer.
Now suppose that we are given an instance of the SRTD problem with
. We construct an NIPC graph from this instance by carrying out the following steps. Here, all edges instantiated are delayless unless otherwise specified, and is equal to the maximum deadline of the tasks in the given instance of the STRD problem.
For each , i) instantiate a send actor when is odd, or a receive actor when is even with and .
ii) instantiate a computation actor with and .
iii) instantiate a computation actor with and .
iv) instantiate an edge and another edge .
Each send actor is connected to the receive actor by an interprocessor edge with a delay of unity. Since each of the interprocessor edges has a delay of unity, these edges are not present in . Without loss of generality, we assume that there are an even number of tasks, so that the number of send and receive actors is the same (if the number of tasks is not even to begin with, we can instantiate an appropriately-defined dummy actor to generate an equivalent "even-task" instance). Observe from our construction that from the tasks in the The reasoning behind our construction and the above claim is that we make the communication actors of the ordered transaction graph correspond exactly to the tasks of the STRD problem. We do this by making the execution time of the computation actor before each corresponding communication actor equal to the release time of the associated task and, thus, guarantee that the communication actors cannot begin execution before their respective release times. Also since computation actors will begin execution from time 0 as each is on a different processor, the release times correspond to when they complete execution. Similarly, the execution time of the computation actors that follow the communication actors are chosen to be so that the corresponding communication actors will have to complete their execution before for the makespan to be less than or equal to . This is true because the computation actor can begin execution immediately after the communication actor has finished. Therefore, the valid SRTD schedule corresponds exactly to the shared bus schedule in the derived instance of the non-iterative transaction ordering problem. If we can find a transaction order that has a makespan less than or equal to , we have a bus schedule that schedules the communication actors in the same manner as an appropriate single-processor schedule for the corresponding SRTD tasks. Conversely, if a transaction order cannot be found that satisfies the given makespan constraint, it is easily seen that there is no valid SRTD schedule for the given instance of the SRTD problem. Q.E.D.
Note that in Theorem 1, we have simplified the problem greatly by assuming the inter-pro-
cessor edges to have unit delays. This removes the inter-dependencies that are imposed by these edges, but even with this simplification, the problem remains NP-complete.
Example 3: Suppose that we are given an instance of the SRTD problem with task set ; and respective release times , lengths
, and deadlines . To construct an instance of the noniterative transaction ordering problem with , we create 4 processors, each with 3 vertices.
The execution times are determined from above -e.g., ,,
NOT graph is illustrated in Figure 8 . Dash-dot edges indicate OT edges. Removing the dash-dot edges that represent the transaction order edges gives us the NIPC graph constructed from above.
This figure shows a transaction order where the schedule length of 11 is satisfied.
This means that there exists a valid SRTD schedule for the given SRTD problem instance. The start times of the tasks can be obtained by finding the longest path lengths between the source nodes and the corresponding communication actors. Setting the starting times of the tasks to equal , respectively, we obtain a valid SRTD schedule for the SRTD problem instance. As demonstrated by the Theorem 2 below, we can extend the proof of Theorem 1 to show that the transaction ordering problem is NP-complete in the iterative context as well as the noniterative case.
Definition 5:
The iterative transaction ordering problem (also called the transaction ordering problem) is defined as follows. Given an IPC graph and a positive integer , does there exist a transaction order such that satisfies ?
Theorem 2: The iterative transaction ordering problem is NP-complete.
Proof:
The MCM can be found in polynomial-time, therefore, the problem is in NP.
To establish NP-hardness, we again derive a reduction from the SRTD problem, and we modify the graph construction from the proof of Theorem 1 so that the MCM equals the makespan. Now suppose we are given an instance of the SRTD problem with
. We construct an IPC graph from this instance by carrying out the following steps. All edges instantiated are delayless unless otherwise specified, and is equal to the maximum deadline of the tasks in the given instance of the STRD problem.
v) instantiate a send actor with and .
vi) instantiate a receive actor with and . an edge with a delay of unity from the last actor on the processor to the first actor.
We again assume without loss of generality that there is an even number of tasks in .
Each send actor is connected to the receive actor with an interprocessor edge of unit delay. Note that in the OT graph , these interprocessor edges become redundant (in the sense of synchronization redundancy, as discussed in [3] ) because of the ordered transaction edges added due to : since the ordered transaction edges are connected by a cycle of delay unity, the constraints imposed by are automatically met by the ordered transaction edges.
This graph effectively represents a blocked schedule for an iterative graph when the execution times of the actors that have been instantiated after step v) have execution times that are much less than the execution times of the other actors, and the MCM of the constructed graph represents the longest path or the schedule length of the graph. Note that each of the longest paths in the non-iterative graph will correspond to a cycle in the iterative case, where the cycle mean of the cycle is equal to the longest path (since the denominator of the associated quotient in (3) is unity).
Similarly, as in the non-iterative case, it is possible to find a one-processor schedule of the STRD instance that satisfies the constraints if we can determine a transaction order whose enforcement will guarantee that the MCM of the corresponding OT graph is less than or equal to . This is true because the communication actors that have non-zero IPC cost in the bus schedule of the OT problem correspond to the tasks in the valid schedule of the STRD problem.
Hence, we can conclude that the (iterative) transaction ordering problem is NP-complete.
Q.E.D. 
Example 4:
Consider again the SRTD instance of Example 4. Figure 9 shows the corresponding ordered transaction graph that results when the ordering is imposed. Removing the OT edges gives the constructed IPC graph. Note that the edges and introduced during construction are redundant in the OT graph due to the paths and , respectively, that are imposed by the linear order and have delays of one or less.
The transaction partial order heuristic
The BFB technique does not take bus contention into consideration while scheduling the transaction order. Instead, it tries to find a transaction order that will be close to or equal to the associated self-timed schedule. However, we have demonstrated that in the presence of non-zero IPC, the OT method can, in fact, perform significantly better than the ST method, and thus, more The algorithm makes sense intuitively since the dependencies imposed by the edges drawn from the candidate vertices will remain when the transaction ordering is enforced.
These edges represent constraints in addition to the interprocessor edges that are already present in and thus, they can only increase the MCM or leave the MCM unchanged. Example 6: When we apply the TPO heuristic to the IPC graph of Figure 2 , the schedule we obtain is illustrated by the Gantt chart of Figure 4 . The corresponding OT graph is illustrated in Figure 14 .
The OT edges corresponding to the actors that have already been scheduled are added as the heuristic proceeds since they represent the schedule of the bus, and hence, make the heuristic more accurate for the later stages of the transaction order. The maximum number of nodes in the ready list at any given instant is (where is the number of processors). The complexity of the algorithm is thus since the complexity of computing the MCM of a graph is .
The edge of the transaction order that connects the last communication actor in the ordering with the first one has a delay of unity (to represent the transition to the next graph iteration).
We can improve the performance of the TPO algorithm by introducing this edge at the beginning because it will give a more accurate estimate of the MCM in choosing vertices later as the heuristic proceeds. Under this modification, the heuristic proceeds as before, except that the "last" (unitdelay) transaction ordering edge is drawn at the beginning. Since has a maximum of 
()E ∈ {} ∈ Figure 13 . Pseudocode for TPO heuristic.
Genetic algorithm for transaction scheduling
Since the transaction ordering problem is intractable, we are unable to efficiently find optimal transaction orders on a consistent basis. We have implemented a branch and bound strategy to explore the search space comprehensively, but this technique requires excessive amounts of time for graphs that have significant numbers of IPC edges. To develop an alternative to this branch and bound approach, and the TPO heuristic, we have implemented a genetic algorithm (GA) to search for the best transaction order. The GA exploits the increased tolerance for compile time that is available for many embedded applications [14] , and can leverage the TPO heuristic by incorporating its solution in the "initial population."
In our GA formulation, candidate transaction orders are encoded using the matrix-based sequence-encoding method described in [7] . Using this method, the partial order of the communication actors is converted into a precedence matrix and randomly completed to yield a random transaction order that is valid. Mutation is carried out by swapping rows and columns, and recombination is performed using the intersection operator explained in [7] . The intersection operator takes subsequences that are common among the parents by taking the boolean "and" of the two parent matrices to form the "offspring," and the undefined part is randomly completed.
A pseudocode sketch of the GA is shown in Figure 15 . For details on the underlying GA concepts (e.g., tournament selection), we refer the reader to 
Dynamic reordering
Once we obtain a transaction order (e.g., using the TPO heuristic or the GA approach defined in Section 6), it is possible to swap the position of consecutive communication actors in the transaction order as long as the new positions do not violate the dependencies imposed by the transaction partial order. This method has the advantage that it cannot degrade the transaction order since we can discard any solution that is worse. The concept is similar to dynamic variable reordering used in OBDD's (Ordered Binary Decision Diagrams) [17] . We have implemented an adaptation to ordered transaction scheduling, called dynamic transaction reordering (DTR), of the Sifting Algorithm introduced by Rudell [21] , and have observed that from DTR, we consistently obtain improvements in the iteration period, regardless of the method used to find the transaction order.
Results
Experiments were carried out to compare the ST method and the OT method, and to measure the performance of the TPO, GA, and DTR heuristics in finding transaction orders. These heuristics were implemented in C/C++ using the LEDA [16] framework for fundamental graphtheoretic data structures and algorithms. The benchmarks are standard DSP applications that have been scheduled using the classic HLFET algorithm [8] with straightforward extensions to incorporate IPC costs.
The IPC graphs are fairly complicated, ranging from between 50-150 nodes, and the num-
bers of processors involved range from 2 to 8. The examples fft1, fft2, and fft3 result from three representative schedules for Fast Fourier Transforms based on examples given in [15] ; karp10 is a music synthesis application based on the Karplus Strong algorithm in 10 voices; and qmf4 is a 4 channel multi-resolution QMF filter bank for signal compression.
In the simulation of the ST schedule, we ignore the overhead of synchronization so as to
give us a worst-case comparison with the OT schedule. In practice, of course, synchronization has nonzero cost, and thus, depending on the actual synchronization overhead in the target architecture, the benefit of the OT schedules examined will be even more that what the results here demonstrate. Thus, our analysis in this section gives a lower bound on the improvement we can expect using the OT implementation strategy in conjunction with our proposed transaction ordering techniques. Table 1 compares the performance (iteration period) of the ST and the OT schedules.
Here, the average iteration period () of the OT schedule is obtained by taking the best performance using the algorithms proposed in Sections 5-7, and denotes the average iteration period of the corresponding ST schedule. In each of the cases, we see that the OT strategy can outperform the ST strategy, and that this holds even though we are ignoring synchronization costs, which gives us a very optimistic view of the performance under ST execution. Table 2 gives us a comparison between the different heuristics in finding transaction orders. Each entry is the iteration period when the transaction order found by the heuristic is enforced. Column 2 shows the iteration period when a randomly-generated transaction order is enforced. From the table, we can conclude that all the heuristics work fairly well compared to the random transaction order. The TPO heuristic for which the results are shown is the modified version that inserts the unit-delay edge beforehand. This consistently gives us a slight improvement.
Generally, the TPO heuristic works better than the BFB technique -especially for fft1 and fft3 -and the heuristic that combines the TPO heuristic and DTR performs best (even better than the GA which, takes significantly more time to execute). The GA was implemented with a population size of 100 and the number of iterations was set to 1000. The GA for the experiments that we tried generally stabilized before the 1000 iteration limit was reached.
When we use the transaction ordering obtained by the TPO heuristic combined with DTR in the initial population of the GA, we achieve the best results since we simultaneously obtain the benefits of all three approaches. The results are shown in Table 3 . 
Conclusions
We have demonstrated that in the presence of accurate estimates for actor execution times, the ordered transaction method -which is superior to the self-timed method in its predictability, and its total elimination of synchronization overhead -can significantly outperform self-timed implementation, even though ordered transaction implementation offers less run-time flexibility due to a fixed ordering of communication operations. We have also shown that in the presence of non-zero IPC costs, finding an optimal transaction order is an NP-complete problem, and we have developed a variety of heuristic techniques to find efficient transaction orders. These techniques include a low-complexity, deterministic heuristic for rapid design space exploration, and a genetic algorithm for exploiting extra compile time when generating final implementations. Useful directions for further work include integrating transaction ordering considerations into the scheduling process, and the exploration of hybrid scheduling strategies that can combine ordered transaction, self-timed, and fully-static strategies in the same implementation based on subsystem characteristics.
References
[1] T. Back, U. Hammel, and H-P Schwefel, "Evolutionary computation: Comments on the history 
