In hardware design, it is necessary to simulate the anticipated behavior of the integrated circuit before it is actually cast in silicon. As simulation procedures are long due to the great number of tests to be performed, optimization of the simulation code is of prime importance. This paper describes two mathematical models for the minimization of the memory access times for a cycle-based simulator.
Introduction
Simulation is a crucial challenge for the design of integrated circuits [14] . In fact, the task involves the iteration of a design and simulation process before tests on real chips are possible. In very few words, a simulator can be viewed as a computer program that reads in input the physical description of an integrated circuit -a VHDL file for example -and produces, in a so-called compilation phase, an executable simulation code that simulates the behavior of the circuit. Then, the test phase consists in running the executable code on a large number of benchmarks. Each benchmark consists of input data and output data: the executable code is given the input data of the benchmark and produces its own output data which are compared to the theoretical output of the benchmark. If produced and theoretical output data are different, it means the circuit has produced wrong data so that it is not correct, therefore the test fails. As the executable code that simulates the circuit is run a very large number of times -some test campaigns may last several days -improving the compilation phase so that the produced code runs faster is of practical interest to significantly reduce the length of the test campaign. In this paper, we propose two theoretical graph problems to model the optimization of the code production.
In this paper, an integrated circuit will be seen as a set of logical gates (such as AND, OR, NOT. . . ) interconnected through wires (see Figure 1(a) ). The simulator assigns a binary variable v i to every wire to store its signal value. Since the value of the output of a gate is a direct function of its input, the evaluation order of these variables must follow a directed acyclic precedence graph (see Figure 1(b) ). The role of the code simulating the circuit is to sequentially compute the values of all the wires in order to compute the output. Each line of the code corresponds to the declaration of a variable v i and the computation of the associated wire value from the input wire variables. The notion of line of code, wire, signal, and program variable are therefore strictly identical for the purpose of this paper. Clearly, since all the inputs of a gate must be computed in order to compute the gate output, the values of the wires must be computed in a topological order induced by the digraph (see Figure 1 (c)). Conversely, any topological order of the digraph yields a different code.
Our problem is to find a topological order that produces the fastest code. The main difficulty in building the model is to find an estimate for the speed of the code. The computation of the values of the gates is constant and independent of the evaluation order of the variables. Now, the value of a variable is stored in the cache memory when it is created and loaded every time it is used to evaluate one of its successors. There are several politics for the cache management (see for example [15] ). For most of them, memory is organized following different levels with different access costs. Each level consists of a stack with a limited storage capacity. A variable is firstly stored in the first level with the fastest access, and if it is not loaded quickly, is pushed to a second level with a lower speed access, and so on... The storage cost is here constant but the loading cost of a variable depends on what happened to the cache since the last access to this variable. Given the great number of variables induced by a large integrated circuit, loading costs are important and reducing them will speed up the simulator. Two models are proposed in Section 2 to evaluate these loading costs.
As the problem is to specify an order for the vertices of a graph, it is closely related to graph layout problems, which consist in numbering the vertices of an input graph in such a way that a given objective function is optimized. The reader is referred to the recent survey by Díaz et al. [8] for a state of the art of these problems which are also referred to as graph (linear) ordering, (linear) arrangement, numbering or labeling problems. These problems are known to be very useful to optimize the processing of large data: for example, "bandwidth (minimization) had received much attention during the fifties in order to speed up several computations on sparse matrices" [8] . However, most research was devoted to non oriented graphs. Our model is based on a directed graph. For such graphs, bandwidth, cut width and linear arrangement problem are known to be NP-complete [12, 10] . Approximation algorithms were proposed by Even et al. [9] and improved by Rao and Richa [16] . Detti and Pacciarelli proposed a branch-and-bound algorithm for a generalization of the directed linear arrangement problem [6] .
We show in Section 2 that one of the two models we propose can be seen as the minimization of the Directed Sum Cut, a generalization of the Sum Cut, an objective function that has been studied in the context of non-oriented graphs [7] . To the best of our knowledge, the oriented version of this problem has never been studied. We prove the problem is NP-complete and we present polynomial algorithms for in-trees and out-trees.
Our second model originates from the Register Allocation problem [18] . In [5] , the problem of code minimization for a k-register machine was shown to be NP-complete for k = 1. Our model, called Uniform Cost Stack, is derived from this model because it relies on a similar set of operators for memory access. The main difference is that the memory is modeled by a stack structure, and the cost function is linear. We prove that the problem is NP-complete even for graphs of depth at most 1, along again with two polynomial cases (in-tree and out-tree).
The two models are introduced in Section 2. Section 3 is devoted to the theoretical study of the Directed Sum Cut and Section 4 presents analogous theoretical results -but proofs are very different -for the Uniform Cost Stack model. In conclusion, some insights into the practical relevance of the two models are presented.
Models
This section mathematically introduces and discusses two combinatorial optimization problems that model the cache optimization problem for the simulation of circuits.
Let G = (V, A) be a directed acyclic graph, it represents the dependence between the variables of the simulation code. The number of vertices of G is denoted by n = |V | and the number of arcs is denoted by m = |A|. For each u ∈ V , Γ + (u) (resp. Γ − (u)) is the set of successors (resp. predecessors) of u and let δ + (u) = |Γ + (u)| and δ − (u) = |Γ − (u)| denote the out-and in-degrees.
Since any possible code is represented by a numbering of the vertices of G, the feasible solutions of the problem are formally described by the set of bijections ϕ : V → {1, · · · , n} satisfying the constraint for every a = (u, v) ∈ A, ϕ(u) < ϕ(v). Such functions are called graph ordering functions or a graph orders. We will often use the notation ϕ −1 (i) for some i ∈ {1, · · · , n} to refer to the vertex whose rank is i in the order ϕ.
The problem is to find the order ϕ that minimizes this objective function. In the rest of this section, two models are proposed to evaluate by two different ways the cache access costs C(ϕ, u). In both models, C(ϕ, u) is a deterministic function that only depends on ϕ and u.
We observe that the expression of the total cost C(ϕ, G) deliberately ignores the time spent on computing the value of the output of the logical gate once the input is read. In fact, this time is assumed to be constant. So, the total time for running the simulation code is the sum of a constant computation time and a cache access time depending on ϕ. Only this second value is minimized.
In the first model, the estimation of C(ϕ, u) is based on the number of instructions executed between to successive use of the variable u. The second model is more complex as it keeps track of all the memory moves.
Directed Sum Cut
This first model is based on the observation that ϕ induces a numbering of the lines of the simulation code: ϕ(v) is the number of the line at which the variable v is created. For instance, in Figure 1 (c), ϕ(V 4 ) = 5 and V 4 is created at the fifth line of the simulation code. In order to introduce the cost function, we consider the use of some variable u after its creation. The first access is made in order to compute the first successor of u w.r.t. ϕ, which is denoted by s 1 (ϕ, u) or simply by s 1 (u). The cost for reading u is proportional to the number of accesses to the cache since the creation of u. We consider here that it is equal to the number of instructions in the simulation code that is C(ϕ, (u, s 1 (u))) = f (ϕ(s 1 (u))−ϕ(u)). Notice that this assumption is not so far from the reality since most of the gates of the circuits have approximatively the same number of adjacent wires, so that every instruction of the simulator code have the same number of arguments.
After this computation, both variables u and s 1 (u) are supposed to be equivalently cached in memory. So, when u is accessed by its second successor s 2 (u), the access cost is C(ϕ, (u,
) is the number of instructions executed since the creation of s 1 (u). Therefore, the total cost related to the access to variable u is
0 otherwise where δ + (u) is the out-degree of u in G, s 0 (u) = u and s 1 (u), s 2 (u), · · · are the successors of u numbered w.r.t. ϕ. In order to simplify the model, we will consider that the cache access cost function is simply the identity function. The choice of such a simple function for f is motivated by the fact that there are no hardware-or software-dependent parameters. Furthermore, this choice yields an interestingly simple expression for the total cache access cost
and for the objective function of the problem. The criterion can be seen as the sum of variable lifespans (difference between the line of creation and the line of last use).
The following theorem shows that the expression of this objective function can be linked to a classical criterion in graph layout problems. Namely, the vertex cut at position i, denoted by δ(i, ϕ, G), is defined as |{u ∈ V :
It represents the number of vertices numbered before i that have at least one successor v numbered after i. In terms of memory management, the interpretation of w / ∈ {u ∈ V :
} is the following: either w is not used anymore, or it has not been created yet.
Theorem 1
We have the equality
where DSC is the Directed Sum Cut of G ordered by ϕ and is defined as DSC(ϕ, G) = 1≤i≤n δ(i, ϕ, G).
PROOF. For any couple (u, v) ∈ V × V , let us consider the indicator ξ(u, v) that is equal to 1 if and only if there is an arc (u, w) ∈ A such that ϕ(u) ≤ ϕ(v) < ϕ(w) and equal to 0 otherwise. By definition, we have
The inner sum is equal to the number of vertices v that are numbered in the interval
This result is the counterpart of the equality between the profile and the reversed sum cut for undirected graphs [8, Observation 2.2 citing [13] ]. In the rest of this paper, this first objective function will be referred to as DSC.
Notations introduced for C are directly adapted to the DSC cost function :
Uniform Cost Stack
The UCS model (for Uniform Cost Stack) is intended to represent the loading costs of the variables which are stored during the execution of a program. We consider here that the cache memory is managed as a stack, and that the loading cost of a variable is proportional to its distance to the top of the stack. This model is an extension of the well-known model of Sethi presented in [18] for the register allocation problems.
The memory is seen as a stack, on which three operators are available:
• RD(α) reads the value of the input for the variable α and pushes it to the top of the stack. The duration of this operation is assumed to be a constant. Therefore, in the model, it is considered to be zero.
• LD(α) moves the variable α stored in the stack to the top. The cost of this operation is proportional to the number of variables stored between α and the top before the move.
• OP(α 1 , · · · , α k ) applies an operator -generically denoted by OP -to the values of the variables α 1 , · · · , α k . It is supposed that α 1 , · · · , α k have been previously moved (with LD-operations) to the first k levels of the stack but these k variables can be in any order inside these first k levels of the stack. This assumption can be justified by the fact that, in a real processor, the parameters of the operators are stored in registers and the order in which the registers are initialized has no importance. The result of the computation of OP is then moved to the top of the stack, the order between the input values staying unchanged. Since the cost of an operation is supposed to be constant, we set it equal to zero.
For any variable in the stack, the distance to the top is called the depth. For example, let us consider a graph G = (V, A) pictured by Figure 2 (a). The ordering function corresponds to the numbers printed inside the vertices. For this ordering function, we can derive the list of RD, LD and OP operations that are executed to evaluate the vertices of the graph. Figure 2 (b) represents these operations for our example with the successive states of the stack -LD(i) means "load variable ϕ −1 (i)". The total cost of an execution is then the sum of the costs of the LD-moves : each of them is associated with an arc (u, v) ∈ A. In Figure 2 (a), the arcs are valued with the corresponding cost. In this way, we get a total UCS cost equal to 9.
For general graphs, the code generation associated with graph order ϕ is more (1) OP (1) OP (2) LD (1) OP (1) OP (4) LD (2) OP (2) 3 (b) Stack evolution complicated: indeed, if a vertex u ∈ V has several predecessors, we have to decide in which order they will be loaded in the stack before u to minimize the cost.
Optimal execution of an order
For a given execution order ϕ, the UCS model as it has been defined so far does not guarantee neither the unicity of the simulation code nor the unicity of the value of the total cost, because an order between the LD-operations, called stacking order, has to be defined for vertices which have several predecessors. Indeed, let us consider the example pictured by Figure 3 . The total cost of an execution depends on the loading order of the variables ϕ −1 (1) and ϕ −1 (2) for the evaluation of ϕ −1 (4). If ϕ −1 (1) is loaded before ϕ −1 (2), the cost is 4, while if the order is reversed, the cost is equal to 3.
In the following, we present an optimal simple -that is algorithmic and polynomial -stacking policy. With this policy added to our model, the UCS cost becomes unambiguously defined.
Definition 2 (Stacking order)
The stacking order θ u of a vertex u ∈ V is defined as a finite sequence
the length of the sequence). It represents the sequence of operations LD(θ u (i)) in the simulation code before the computation of u begins. The total stacking order is the set θ = {(u, θ u ), u ∈ V }. A stacking order is compatible with respect to a given order ϕ if the resulting simulation code works, that is the variables are well loaded for each operator OP. The cost of a compatible θ for the graph order ϕ is denoted by UCS θ (ϕ, G) or shortly UCS θ (ϕ). For the example pictured by Figure 3 ,
Let ϕ be an order. Let us consider a vertex u ∈ V that has to be evaluated. The elements of Γ − (u) are denoted by v 1 , · · · , v pu (with p u = |Γ − (u)|) and we assume they are numbered in the non-decreasing order of their stack depth. Let q 1 , · · · , q pu be the respective depths of
between v i and the top of the stack. Therefore, v i must necessarily be loaded before u can be computed. We denote by i u the maximal
Lemma 3 For a given ϕ, there is a stacking order θ that minimizes
PROOF. Clearly, K u = p u − i u implies that θ u (i) ∈ {v iu , · · · , v pu } because all these variables must be loaded. Therefore, we only show that there is an optimal stacking order such that K u = p u − i u .
Let θ be a compatible stacking order, and let u ∈ V such that K u > p u − i u . Two cases must be studied :
Let us consider the order θ u obtained by removing the operation LD(v i ) from θ u :
θ is clearly compatible with respect to ϕ. We are going to prove that UCS θ (ϕ) − UCS θ (ϕ) ≤ 0. Let W be the set of vertices which are between v i and the top of the stack before the operation LD(v i ), then the cost of LD(v i ) in θ is |W |. With the removal of LD(v i ) in θ , the cost of the next load of a variable in W is decreased, the other loading costs for variables other than v i are not changed, that is Fig. 4 . Stack when v i is loaded again (for both stacking orders θ and θ ) ¡ I ¡ I P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P P ¡ P ¡ P
S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S S ¡ S ¡ S T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T T ¡ T ¡ T
...
Fig. 5. Comparison between stacking orders θ and θ
If v i is not loaded again, we clearly have UCS θ (ϕ) ≤ UCS θ (ϕ). If v i is loaded again, let us denote by q (resp. by q ) the cost of the next load of v i when using the order θ (resp. θ ). We have that q ≥ q and Figure 4 illustrates the stack states for θ and θ before the "next load" of v i , W is the set of elements of W which have not been loaded again after the removed operation LD(v i ). Clearly, q = q + |W | ≤ q + |W | so that UCS θ (ϕ) ≤ UCS θ (ϕ).
• There exists v i ∈ Γ − (u), i > i u and θ u (k 1 ) = θ u (k 2 ) = v i with 1 ≤ k 1 < k 2 ≤ K u (v i is loaded more than once by θ u ). Then removing the first load does not alter the final state of the stack.
Lemma 4
The stacking order such that θ u = (v iu+1 , · · · , v pu ) for every u ∈ V is optimal.
PROOF. Let us suppose that there exists an optimal stacking order θ satisfying Lemma 3 but different from the order of the current lemma statement.
Let us now consider the minimal integer k ∈ {1, · · · , p u − 1} such that θ u (k) = v j , θ u (k + 1) = v i and j > i (the first inversion). Let θ u be the order defined by the inversion of θ u (k) and θ u (k + 1) and let θ be the total stacking order derived from θ after changing (u, θ u ) into (u, θ u ).
We prove that UCS θ (ϕ) ≤ UCS θ (ϕ). Let us compare the successive stack states for two executions corresponding to θ and θ , as it is illustrated by Figure 5 . Before the program arrives at the pair of operations "LD(v j ); LD(v i )" of θ (which corresponds to "LD(v i ); LD(v j )" for θ ) the θ-stack and the θ -stack are identical at each step. Afterwards, v j and v i are swapped in the stack until an operator LD(v i ) or LD(v j ) is met. So, clearly, the cost difference between UCS θ (ϕ) and UCS θ (ϕ) is due to the three LD operations we have just emphasized. Clearly, the worst case is if the third operation is LD(v i ). Let q be the depth of v i in the θ-stack before the third LD, q is also the depth of v j in the θ -stack at the same time. So, we have
In the following, we will suppose that, for an order ϕ, variables are always loaded in the optimal order θ = θ (ϕ) given by Lemma 4. Therefore, the cost of a graph order can be denoted without ambiguity by UCS(ϕ) = UCS(ϕ, G) = UCS θ (ϕ, G). UCS(ϕ, (u, v)) will denote the cost of the load of u in order to compute v.
Remarks
The DSC cost function can be trivially computed in O(m) time. The UCS cost function can be naively computed in O(mn) by using a linked list to represent the stack. However, by using AVL trees [1] instead, the computation time can be improved to O(m log n) [17] .
For both the DSC and UCS cost functions, the problem can be naturally decomposed when the precedence graph has several connected components. Formally, for C ≡ DSC or C ≡ UCS, if the graph G has k connected components
The DSC model
This section is dedicated to the DSC model. We first prove that the problem is NP-complete, even for graphs with depth equal to 2. Then, we prove that the problem is polynomial for in-trees and out-trees.
Complexity
We did not find in the literature any proof of the complexity of the Directed Sum Cut problem. We prove in this section that the problem is unsurprisingly NP-complete. The problem is NP-complete even for the digraphs of depth 2 (here, the depth is the number of arcs of a longest path). We consider the following decisional variant of the problem of the minimization of the Directed Sum Cut. 
Lemma 5 There exists a polynomial transformation from MinLA to MinMaxEdge.
PROOF. Let us consider an arbitrary graph G = (V, E) given as an input of MinLA. Let δ G (u) denote the degree of the vertex u ∈ V in G and let 
G max (see Figure 6 ). Since max(i, j)
In the right side of the final equality, the first member of the sum is a constant while the second one is the linear arrangement of G. So a solution f for MinMaxEdge with a cost δ PROOF. Let G = (V, E) be an arbitrary multi-graph (input of MinMaxEdge). We build the graph G = (V , E ) with V = E ∪ { } ∪ V (note that E is a multiset and may contain duplicate values corresponding to parallel arcs, V is however a set: the multiple occurrences of an element of E are differenti-ated from each other in V ). E is the union of the three following sets (see Figure 7 ) : -{(e, )|∀e ∈ E} -{( , u)|∀u ∈ V } -{({u, u}, u)|∀{u, u} ∈ E} ∪ {({u, v}, u), ({u, v}, v)|∀{u, v} ∈ E|u = v} Clearly, for any order ϕ of G , ϕ(E) = {1, · · · , |E|}, ϕ( ) = |E| + 1 and ϕ(V ) = {|E| + 2, · · · , |E| + |V | + 1}. Since, in G , there is no outgoing arc from the nodes in V , the cache function for any order ϕ is : , v}) is the sum of the integers 1, · · · , |E|, and max u∈V ϕ(u) is the last number of the order, that is |E|+|V |+1, we finally have:
So, when we have a directed order ϕ for G , we can build a bijective function f for G by taking for any u ∈ V f (u) = ϕ(u)−ϕ( ). The above equality becomes DSC(ϕ, G ) = {u,v}∈E max(f (u), f (v)) + |V |. Therefore if the directed sum cut of ϕ is less than |V | + K, the cost of the bijective function f is less than K. Conversely, if we have a bijective function f for G with cost less than K, we can easily build an order with a directed sum cut less than |V | + K by taking, for any u ∈ V , ϕ(u) = f (u) + |E| + 1, ϕ( ) = |E| + 1 and by randomly ordering the elements of E. Now, since MinLA [11] is NP-complete, we deduce the following theorem : Theorem 7 MinDSC is NP-complete for digraphs of depth 2
The approximation techniques of Rao and Richa [16] , based on the Divide-andConquer approximation method presented by Even et al. [9] , can be directly adapted to give an O(log n)-approximation algorithm for MINDSC.
Polynomial cases

In-tree
Here the precedence graph G = (V, A) is an in-tree, i.e. each node v ∈ V has at most one outgoing arc. With this property, the expression of the directed sum cut is greatly simplified:
This expression shows that when the precedence graph is an in-tree, the directed sum cut is equal to the directed linear arrangement. The latter problem has been shown to be polynomial [2] .
It can also be observed that the problem is equivalent to the scheduling problem 1|intree, p i = 1| w i C i , in which each task i corresponds to a node in v ∈ V , the precedence graph is equal to G, and the task weights are w i = δ − (v). This problem is of course polynomial (see for example [4] ).
Out-tree
The precedence graph G = (V, A) is now an out-tree, i.e., each node v ∈ V has at most one incoming arc. From Section 2.3, we assume w.l.o.g. that G is connected so that m = n − 1. We present an algorithm that computes the optimal ordering of the nodes in linear time. This algorithm is based on the following lemma.
Lemma 8 There exists an optimal order in which all the nodes in some subtree T of the root node of G are ordered after all the nodes that are in G\T .
PROOF. Let us consider an order ϕ that does not verify this property. By the way of the transformation depicted in Figure 8 , we are going to construct a new order ϕ such that DSC(ϕ , G) ≤ DSC(ϕ, G). The root node r clearly satisfies ϕ(r) = 1, let T denotes the subtree of r that contains the "last-ordered" node ϕ −1 (n) and let r T be the root of T . From our initial assumption, we have that ϕ(r T ) < n − |T | + 1, which means that at least one node of G\T is ordered in between the nodes of T . The new order ϕ (see Figure 8 ) is build such that (1) the relative order between the nodes in T is not modified, (2) the relative order between the nodes in G\T is not modified, (3) ϕ (T ) = [n − |T | + 1, n] and, consequently, ϕ (G\T ) = [1, n − |T |].
These three rules clearly define the construction of a unique order ϕ . This order is compatible with the topological order because (r, r T ) is the only arc between T and the rest of G and, after the transformation, we still have ϕ(r) = 1 < ϕ(r T ).
In order to show that DSC(ϕ , G) ≤ DSC(ϕ, G), we consider an arc (u, v) such that u = r -the case u = r is studied afterwards.
is the decrease of the cost of (u, v) by re-ordering ϕ in ϕ . By construction, ∆(u, v) is positive. Indeed, if (u, v) is in the subtree T , ∆(u, v) is equal to the number of nodes w / ∈ T such that ϕ(u) < ϕ(w) < ϕ(v). Symmetrically, if (u, v) is not in the subtree T , ∆(u, v) is equal to the number of nodes w ∈ T such that ϕ(u) < ϕ(w) < ϕ(v). Therefore, the access cost of each node u = r has decreased, that is, with the formulation of the cache cost given in (1), C(ϕ , u) ≤ C(ϕ, u).
However, the access cost C(ϕ, r) of the root node r generally increases. The increase is equal to
. Let ∆(r T ) denote the value of the right side of this inequality. We observe that ∆(r T ) is equal to the number of nodes w ∈ T such that ϕ(w) > ϕ(r T ).
We complete the proof that DSC(ϕ , G) ≤ DSC(ϕ, G) by showing that the decrease of the total access cost of all the nodes u = r is at least ∆(r T ). ϕ −1 (T ) is a subset of {1, · · · , n} so that it can be seen as the union of integer intervals I 1 < I 2 · · · < I k . Clearly, ϕ is a bijection between the nodes of T and k i=1 I i . Let us consider the arc (u 1 , v 1 ) of T such that ϕ(u 1 ) ∈ I 1 and ϕ(v 1 ) is maximum. Since T is connected, ϕ(v 1 ) is in some interval I k 1 with k 1 > 1. v 1 is by construction the last successor of u 1 according the order ϕ so that the access cost of u 1 is ϕ(v 1 ) − ϕ(u 1 ). If k 1 < k, we can iterate the construction: let (u 2 , v 2 ) be the arc such that u 2 ∈ I k 1 and ϕ(v 2 ) is maximum. Let I k 2 be the interval that contains ϕ(v 2 ). At the end, we construct a sequence of arcs (u 1 , v 1 ), · · · , (u l , v l ) and intervals
is equal to the number of nodes w ∈ T such that ϕ(u i ) < ϕ(w) < ϕ(v i ). Therefore, since v i and u i+1 are in the same interval I k i , the sum
Therefore DSC(ϕ , G) ≤ DSC(ϕ, G), which completes the proof.
Let us now determine how to select the terminal subtree. For each direct descendant u of r, let T (u) denotes the subtree rooted at u. If T (u ) denoted the final subtree, we have C(ϕ, r) = n − |T (u )|, so
Therefore, in order to minimize the Directed Sum Cut, u must be selected such that T (u ) is a largest subtree of r.
So, the decomposition shows that the ordering of an out-tree G is given by calling the following recursive algorithm with the root of G as first parameter and 1 as second parameter. proc orderouttree(r,i) : ϕ(r) ← i if r is not a leaf then let u be one descendant of r such that
We finally prove the complexity of this algorithm.
Theorem 9
The minimal directed sum cut of an out-tree can be computed in O(n) time.
PROOF. In a preprocessing phase, all the sizes |T (u)| of the subtrees for all the nodes u of G can be computed in O(n) time. The recursive procedure is called once for each node and selecting u at a given node r takes O(δ + (r)) time, so the total time for the algorithm is O(n).
Let us now consider the variant of the DSC cost function where
). If we assume that f is concave and nondecreasing, we can prove, by using the inequality f (x + y) ≤ f (x) + f (y) for any x, y ≥ 0, that the lemma and the algorithm to solve the problem both hold. If the function f is convex, the lemma is not true anymore.
The UCS model
Complexity of UCS for a bipartite graph
We prove here that the decision version of UCS is NP-complete even for bipartite graphs. The problem is defined as follows:
Minimum Bipartite uniform cost stack (BipUCS) INSTANCE : G = (V, A) a bipartite directed acyclic graph, an integer K. QUESTION : Is it possible to find a bijective function ϕ : V → {1, ..., |V |} such that UCS(ϕ, G) ≤ K ?
We prove that BipUCS is NP-complete using a reduction from Minimum Linear Arrangement (MinLA).
Theorem 10 There exists a polynomial transformation from MinLA to BipUCS.
PROOF. Let us consider an instance Π of MinLA given by a graph H = (W, E) with W = {1, · · · , n} and an integer B. Let m be equal to |E|. We build an associated instance Π of BipUCS defined by a graph G = (V, A) and an integer K with V = X ∪ Y defined as follows (see Figure 9 ):
• The set Y corresponds to W . We denote by y i the element of Y that is associated to the vertex i of W .
• The set X is the union of the set X(E), which corresponds to E and the sets Q(y 1 ), · · · , Q(y n ). The element of X(E) corresponding to the edge e = {i, j} of E is denoted by x(e). Each set Q(y i ) has n 6 elements and the sets Q(y i ) are pairwise disjoint.
• Each vertex in Q(y i ) has one successor that is y i . Each vertex x({i, j}) ∈ X(E) has exactly two successors y i and y j .
• K is set to be equal to (n
This transformation is clearly polynomial and the graph G is bipartite.
Let us suppose that the answer to Π is "yes" and let f be a solution. In order to simplify the notation, we assume without loss of generality that f (i) = i for each i ∈ W . An order ϕ of the corresponding instance Π is computed by numbering vertices y i of Y correspondingly to the order f of W . In other words, ϕ(y 1 ) < ϕ(y 2 ) < ... < ϕ(y n ). Assuming that ϕ(y 0 ) = 0, we number the elements of X as follows : x(e 1 )
x(e 2 )
x(e 3 )
x(e 4 )
x(e 5 ) G = (V, A) .... • from ϕ(y i−1 ) + 1 to ϕ(y i−1 ) + n 6 , the n 6 elements of Q(y i ), • from ϕ(y i−1 ) + n 6 + 1 to ϕ(y i ) − 1, the elements x({i, j}) of X(E) such that i < j. Clearly, the vertices x({i, j}) with j < i have been numbered before ϕ(y j ). Now, we prove that UCS(ϕ, G) ≤ K. For every (x, y) ∈ A, UCS(ϕ, (x, y)) denotes the load cost of x for the evaluation of y. We obtain:
Q(y
In this sum, UCS(ϕ, (x, y)) > 0 only for some y = y i and x = x(({i, j}) with j < i. Let us consider such a vertex x. x is a predecessor of y i , it was first pushed in the stack with a RD-operation for the computation of y j , that was executed before the computation of y i (j < i implies that ϕ(y j ) < ϕ(y i )). We can set UCS(ϕ, (x, y i )) = ∆(x, y j ) + ∆(y j ) where ∆(x, y j ) (resp. ∆(y j )) is the number of vertices stacked between x and y j with y j included (resp. between y j and the top of the stack) just before the execution of y i (see Figure 10 ).
Every vertex y j ∈ Y has at most m predecessors x in X(E), so ∆(x, y j ) ≤ m. Moreover, every vertex y k ∈ Y has at most n 6 + m predecessors. Since tasks from Y are stacked according to ϕ, we get ∆(y j ) ≤ (f (i) − f (j))(n 6 + m + 1).
∆(x, y j )
So, we get UCS(ϕ, G) ≤ (n 6 + m + 1)B + m 2 = K. ϕ is then a solution for the instance Π of BipUCS.
Conversely, let us suppose that ϕ is a solution to the instance Π of BipUCS. We have UCS(ϕ, G) ≤ K. We can assume w.l.o.g. that tasks in W are numbered such that ϕ(y 1 ) < ϕ(y 2 ) < · · · < ϕ(y n ). Then, we build the order function f (i) = i for any i ∈ W and we prove that this function f ≡ Id is a solution for the instance Π of MinLA.
Let us consider the state of the stack before some LD operation just before the creation of the variable y j . Since none of the y k , k ∈ {1, · · · , j − 1} have been loaded after their creation, the elements of Y appear in the stack in the order of their indices.
Firstly, one observe that for each k ∈ {2, · · · , n}, elements from Q(y k ) are stacked just after y k−1 . Indeed, elements of Q(y k ) have not been reloaded, so that they are stacked between y k−1 and y k . Moreover, let us suppose that there exists a vertex x(e) ∈ X(E) such that
where z * is the element from Q(y k ) which value ϕ(z * ) is maximum. Since z * will not be loaded any more, UCS cost will decrease by swapping ϕ(x(e)) and ϕ(z * ).
Now, let e = {i, j} ∈ E with i < j. At the computation of y j , x(e) will be reloaded : its depth is greater than the depth of y i , which is greater than (n 6 + 1)(j − i) (as a consequence of the above observation). So, UCS(ϕ, (x(e), y j )) ≥ (f (j) − f (i))(n 6 + 1) (we use f ≡ Id). Now, UCS(ϕ, G) = e={i,j}∈E,i<j UCS(ϕ, (x(e), y j ))
f is then a solution to Π.
Corollary 11 UCS is NP-hard for a bipartite directed acyclic graph.
Polynomial cases
In-tree
We suppose here that G = (V, A) is an in-tree. For every u ∈ V , Γ * − (u) is the set of the ancestors of u and s(u) is the unique successor of u in G. We also denote by G(u) the subtree of G rooted by u and by r the root of G.
Lemma 12 For any execution order ϕ, another ϕ is built with U CS(ϕ , G) ≤ U CS(ϕ, G) and such that all the ancestors of any vertex u ∈ V are ordered by ϕ just before u:
PROOF. Let us suppose that ϕ is an optimal order which does not fulfill the condition expressed by the lemma. Let u be the first (for order ϕ) vertex of V which does not verify this condition. Let k be the last node (for ϕ) such that ϕ(k) < ϕ(u) and k / ∈ Γ * − (u). Fig. 11 . Transforming an order for the UCS cost of an in-tree
By the minimality of ϕ(u), predecessors of k are computed just before k. For the sake of clarity, the following sets are defined (see Figure 11) : Ω 1 = {v ∈ V, ϕ(v) < ϕ(k) − |Γ * − (k)|} Ω 2 = {v ∈ V, ϕ(k) < ϕ(v) < ϕ(u)} Ω 3 = {v ∈ V, ϕ(u) < ϕ(v)} Notice that s(u) and s(k) are both belonging to Ω 3 . A new order ϕ will be derived from ϕ by moving k and Γ * − (k) just after u (figure 11).
For every v ∈ V , we set ∆(v) = UCS(ϕ , (v, s(v))) − UCS(ϕ, (v, s(v))). We prove that v∈V −{r} ∆(v) ≤ 0. The value ∆(v) depends on the 6 following cases :
( At each step u ∈ V , we have to sort the values |G(v)|, v ∈ Γ − (u). The complexity of the algorithm is then O(n log n).
Out-tree
We suppose here that G = (V, A) is a connected out-tree (|A| = m = n − 1). We prove here that the optimal order computed for DSC in Section 3.2.2, denoted by ϕ * , is also optimal for UCS. We first prove an inequality linking UCS and DSC cost functions
Lemma 14
In an out-tree, UCS(ϕ) ≥ DSC(ϕ) − m for any graph order ϕ.
PROOF. Let us consider the unique LD operation associated to the arc (u, v) of the out-tree. We prove that UCS(ϕ, (u, v)) ≥ DSC(ϕ, (u, v)) − 1. If u is already on the top of the stack, it means that it is the result of the previous RD or OP operation. So we have ϕ(v) = ϕ(u) + 1 and v is the first successor of u. Then UCS(ϕ, (u, v)) = 0 = DSC(ϕ, (u, v)) − 1. If u is not on the top of the stack, then v is the i th successor of u for some 1 ≤ i ≤ δ + (u). The last time that u was on the top of the stack was when u was created (if i = 1) or just before s i−1 (u) was computed (if i > 1). In any case, at least ϕ(s i (u))−ϕ(s i−1 (u))−1 variables were pushed by OP operations afterwards. So, UCS(ϕ, (u, v)) ≥ ϕ(s i (u))−ϕ(s i−1 (u))−1 = DSC(ϕ, (u, v))−1. By summing these inequalities, we eventually have that UCS(ϕ) ≥ DSC(ϕ) − m.
Theorem 15
For an out-tree G, the optimal solution ϕ of DSC is also optimal for UCS. So, the minimum UCS can be computed in O(n) time.
Conclusion
This paper has proposed two combinatorial optimization models for the problem of minimizing the memory access times for an integrated circuit simulators. The problems are NP-hard even when the depth of the graph is bounded. However, when the graph describing the circuit is an in-tree or an out-tree, the problems are polynomial for both our criteria.
The hypothesis on the evaluation functions can be criticized because cache policies may be randomized and the real access times depends of numerous other parameters such as the cache size, the operating system (and its settings), the memory state when the simulation code is run, the programs that are concurrently run and many others. However, some experimental tests have shown that the two criteria are correlated on graphs derived from existing integrated circuit. Moreover, thorough tests on real integrated circuits simulators have shown that a real simulation speed up can be obtained when the simulation code is based on a graph order obtained with heuristics taking into account both criteria [3] .
