An area-universal VLSI circuit can be programmed to emulate every circuit of a given area, but at the cost of lower area-time performance. In particular, if a circuit with area-time 
Introduction
Area-universal circuits are VLSI designs that can be programmed to emulate all the circuits of a given area A with bounded loss in area-time performance. The study of area universality can give valuable insights into general purpose multiprocessing, as well as into the exploitation of field programmable gate arrays, and other forms of reconfigurable architectures for VLSI.
Two parameters characterize the quality of a universal circuit. The blowup, α = A u /A, where A u is the area of the universal circuit (also called host circuit), and A is the area of the emulated circuit (also called guest circuit) measures the hardware cost of a universal design. The slowdown σ = T u /T , where T u is the time taken by the host circuit to emulate T steps of the guest circuit, measures the speed penalty incurred. The tradeoffs achievable between α and σ are of great interest.
Area universality has received considerable attention in the literature. The pioneering work of Leiserson [1] introduced the first efficient universal network, the concentrator fat-tree, establishing its effectiveness for off-line routing, later extended to randomized on-line routing in [2] . Bay and Bilardi [3] proposed the pruned butterfly fat-tree and the sorting fat-tree for efficient deterministic on-line routing. Greenberg [4] defined the pyramid fat-tree and established its universality properties under various delay models for signal propagation. The blowup and slowdown of the above mentioned constructions are for on-line routing were given by Leighton, Maggs, Ranade, and Rao [5] and by Bay and Bilardi [6] , in the word and in the bit models of VLSI computation, respectively. The papers by Bilardi and Bay [7] and by Bilardi, Chauduri, Dubashi, and Mehlhorn [8] explored the slowdown-blowup tradeoff from the perspective of lower bounds. Fat-tree like networks have been adopted in some commercial multiprocessors (e.g., see [9] [10] [11] ).
While the results of [5] and [6] show that constant blowup is achievable with polylogarithmic slowdown, a question that has remained open is whether constant slowdown is achievable and, if so, at what price in area. Initial progress in this direction was made by Kaklamanis, Krizanc, and Rao [12] , who showed that a butterfly of area A 1+ can simulate any area-A network, further constrained to have O(
with slowdown O(log log A).
In this paper, we exhibit an area-universal circuit U A with blowup O (A ), for any chosen positive ≥ 4 log log A/ log A, and with slowdown O (1/ ). When is chosen to be a fixed constant, our construction yields constant slowdown.
An overview of U A
The high level structure of our circuit U A is that of a binary fat-tree of A leaves, whose nodes, henceforth called fat-nodes, have size decreasing with their distance from the root. In contrast to most previously proposed fat-trees, where the leaves are in charge of computation and the internal nodes support communication, each fat-node of U A performs both computation and communication functions. Figs. 1 and 2 illustrate the structure and a coarse level layout of the overall fat-tree, and of one fat-node, respectively.
Specifically, a fat-node at level i in the fat-tree is equipped with a number i = O A/2 i of so called emulation trees of height h/4, where h = ( /2) log A. For emulation purposes, each vertex 1 u of the guest circuit G is handled by a suitably chosen emulation tree T u , using the technique proposed by Meyer auf der Heide in [13] , which can be briefly described as follows. Each node of T u emulates a single vertex of G, specifically, the root emulates u and if an internal node emulates some vertex v, its children emulate the neighbors of v in G (observe that the same vertex may be emulated by many nodes of the tree). The tree T u has the following capability: if each of its nodes is initialized with the state of the corresponding guest node at (guest) step t, then T u can produce at its root the correct state of vertex u at (guest) time t + h/4 in O (h) steps. In this process, for j = 1, . . . , h/4, during the simulation of the j-th guest step only the nodes in the top h/4 − j + 1 levels of T u can update their simulated state. After simulating h/4 steps, a phase of global communication is needed to send the current state information from the roots of the emulation trees where it has been computed, to the nodes of the emulation trees where it is needed.
The network that accomplishes this state redistribution contains two components. First, in a fat-node ν at level i, the i roots of the emulation trees are connected to the roots of a number of broadcast trees, which touch all fat-nodes within a neighborhood of ν of radius h and serve the purpose of transporting the state updates for the guest nodes emulated in ν, to those fat-nodes where such updates may be necessary to reinitialize the state of their emulation trees. Second, ν contains a network which connects the broadcast-tree nodes touching ν to the nodes of the emulation trees in ν. This latter network is a distributor, that is, a network of switches obtained by a simple adaptation of the Beneš permutation network [14] . The distributor can be programmed to connect each output (here, a node of an emulation tree) to at most one arbitrarily selected input (here, a node of a broadcast tree). As a consequence, a given input can be connected to zero or more outputs. In summary, each fat-node contains (i) a number of emulation trees, (ii) one distributor, and (iii) a set of broadcast-tree nodes. Each fat-node is connected to its parent and its two children by channels, consisting of a collection of broadcast-tree edges (see Fig. 2 ).
For the emulation scheme to be feasible, guest vertices must be mapped to fat-nodes, observing two constraints: (a) since each guest node is handled by a single emulation tree, there must be at least as many emulation trees in a fat-node as the number of guest vertices mapped to the fat-node; (b) since each broadcast trees spans a neighborhood of radius h, in order to guarantee that the state updates reach all the emulation trees where they are required, vertices whose distance is at most h/4 in the guest circuit must be mapped onto fat-nodes, whose distance is at most h in the fat-tree. In order to establish such a mapping, we make use of a key result of layout theory, first established by Bhatt and Leighton in the bifurcator framework [15] .
The rest of the paper is structured as follows. In Section 2, we review elements of bifurcator theory and emulation techniques that play a role in our construction. For added clarity, we describe the design of U A in two stages. First, following the ideas outlined above, in Section 3 we discuss the details of a simplified and slightly less efficient variant of U A , where we assume that there is a distinct broadcast tree for each emulator tree root, and analyze its area requirement in Section 4. Then, in Section 5, we show how to reduce the number of broadcast trees by means of pipelining so to obtain the stated area-time bounds for U A . Finally, Section 6 discusses futher ways of improving our design, while Section 7 offers a few concluding remarks.
Background
We will be dealing with graphs which can be laid out in Thompson's grid model for VLSI [16] , hence in what follows we will restrict our attention to graphs in which no vertex has degree greater than four. The construction of our universal circuits relies on two main ingredients: ''good'' embeddings of such graphs into binary trees and emulation techniques based on redundant computation. The following two subsections explore each of these aspects in detail.
Bifurcators and graph embeddings
Recall that an (F 0 , F 1 , . . . , F r−1 )-decomposition tree for a graph G is a binary tree T of height r whose nodes are subgraphs of G satisfying the following constraints:
(1) The root of T is G. Recall that an embedding of a graph
of the vertices of G to the nodes of H and a mapping of the edges of G to the paths in H, such that any edge (u, v) ∈ E G is mapped to a simple path in H whose endpoints are ϕ(u) and ϕ(v). When H is a tree, the path corresponding to edge (u, v) is completely determined by ϕ as the unique simple path in H with endpoints ϕ(u) and ϕ(v).
For a given embedding, the load of a node v H ∈ V H is the number of vertices of G mapped to v H ; the congestion of an edge (u H , v H ) ∈ E H is the number of paths realizing edges of G that pass through (u H , v H ); and, finally, the dilation of an edge (u G , v G ) ∈ E G is the length of the path in H corresponding to (u G , v G ). The maximum value of each of these quantities is called load, congestion and dilation of the embedding, respectively. The following theorem is an immediate consequence of Theorem 7 of [15] and highlights an important relation between graph bifurcators and tree embeddings: The following is an immediate corollary of Fact 1 and Theorem 2.
Corollary 3. Any graph G with a layout of area A can be embedded into a complete binary tree of A leaves with dilation at most

4, load i ≤ c A/2 i at tree nodes at level i, and congestion c i ≤ c c A/2 i at tree edges connecting a node at level i with one of its children, for suitable fixed constants c , c c > 0.
Emulations through redundant computation
In addition to the embedding technique reviewed in the previous subsection, our emulation crucially exploits the use of redundant computation, following an approach proposed in [13] for the emulation of arbitrary bounded-degree networks. Informally, we allow that a single vertex of G be emulated by a suitably large subset of nodes of H. During the emulation, some nodes in the subset may be ''left behind'', in the sense that their state is outdated with respect to the state of the most advanced replica in the subset. In order to ''refresh'' the state of lagging nodes, from time to time during the emulation all the necessary information is routed from the replicas that are up to date to those that are out of date. Constant slowdown can be achieved if the refreshing operations can be performed seldom enough that their cumulative cost is at most proportional to the number of steps of the computation being emulated.
Constant-slowdown emulations through redundant computations were first obtained by Meyer auf der Heide in [13] Since our construction borrows some ideas from the scheme of [13] , we recall the main features of such a scheme. The overall structure of M c,N is very simple: its basic constituents are a set of N c-ary trees of height t, and an (N, N ·c t )-distributor. The latter network has N distinguished inputs and N ·c t outputs, coinciding, respectively, with the roots and the leaves of the N trees. By definition, a distributor can be prepared off-line, to realize any communication pattern where each input sends the same message to a subset of outputs and each output receives exactly one message from one of the inputs. In [13] , it is shown that a simple variation of the Beneš network The emulation algorithm proceeds in phases, each phase emulating t steps of G. At the beginning of Phase i, i = 0, 1, . . . , each tree node contains the state at time i · t of its associated graph vertex. Then the emulation in each tree takes place, so that, after t steps, the N tree roots produce the sequence of state updates at times i · t + 1, . . . , (i + 1) · t of the vertices in V G . The distributor is configured in such a way that, for each u ∈ V G , the root of T u can pipeline the sequence of t state updates occurred during the phase to all the emulator tree nodes associated with u. In an additional t steps, any tree node is therefore able to compute the state of its associated graph vertex at time (i + 1) · t. Each phase requires time O (tc + log N). By choosing t = log c N, we obtain O (c + 1/ ) slowdown and an overall number of O N 1+ log N nodes in M c,N . We remark that the need to transmit strings of t state updates arises when emulating processor networks, where the state of each vertex can be rather large. In our VLSI context, vertex states are essentially boolean values, making it sufficient to transmit the final value to each node associated to that vertex in any emulation tree.
Note that since a circuit of area A features at most A vertices of degree 4, M 4,A can emulate any such circuit with constant slowdown. However, M 4,A has a large area of at least Ω A 2(1+ ) [17] , since it contains a Beneš permutation network [14] with Θ A 1+ inputs as a subgraph. In the following sections we combine some ideas from [13] with the bifurcator-based properties of a circuit of area A, to obtain a universal circuit whose area is roughly the square root of the area of M 4,A .
Structure of the circuit and emulation algorithm
As mentioned in the previous section, the large area of M 4,A is due to the presence of a full-blown distributor with Θ A 1+ outputs. Beside having at most A nodes of degree at most 4, a circuit of area A exhibits additional structure, as captured by the embedding derived from the bifurcator, in Corollary 3. This structure will let us substitute the large distributor of M 4,A with several ones of much smaller size, with a considerable reduction of the overall area requirement. Following the bifurcator-induced embedding, the basic structure of our universal circuit U A is that of a complete binary tree of fat-nodes, of height log A. A fat-node at level i, 0 ≤ i ≤ log A, contains i emulator trees (i.e., 4-ary trees) of height h/4, where i is the upper bound on the value of the load at level i in the tree-embedding of an area A circuit, and h is an integer parameter whose value will be chosen as a function of A and to govern the blowup-slowdown tradeoff of U A , with larger h (that is, larger ) yielding a faster but larger universal circuit. The fat-node also contains a distributor of suitable size, which is used as in [13] to restore the current state in all the nodes of the emulator trees at certain times during the emulation. Consider now an arbitrary circuit G of area A. The universal circuit is prepared for the emulation of G by assigning one emulator tree to each vertex of G. The assignment follows the embedding of G into the tree, in the sense that the emulator tree for vertex u is chosen among those residing in fat-node ϕ(u). Note that an emulator tree T u of u ∈ V G , may contain nodes associated with vertices in V G whose emulator trees reside in fat-nodes different from ϕ(u). However, recall that the embedding provided by Corollary 3 has dilation at most 4. Since the vertices in V G associated with nodes of T u have distance at most h/4 from u in G, it follows that their corresponding emulator trees reside in fat-nodes at distance at most h from ϕ(u), in the fat-tree underlying the universal circuit. We call this set of fat-nodes an h-neighborhood of ϕ(u). Clearly, the roots of the emulator trees within an h-neighborhood of ϕ(u) are the only ones that need to be connected to the nodes of the emulator trees in ϕ(u) through the local distributor. The following lemmata quantify the number of distinct emulator trees in an h-neighborhood of a fat-node as a function of its level in the tree.
Lemma 4. Let L i,j be the number of vertices of G embedded within a subtree of height j rooted at a fat-node at level i, with
Proof. The load of a fat-node at level i + s in the tree is at most i+s = c A/2 i+s . Therefore
Lemma 5. Let N i be the total number of vertices of G embedded in fat-nodes of the h-neighborhood of a fat-node at level i, with
Proof. Consider a fat-node ν at level i, with 0 ≤ i ≤ log A. Its h-neighborhood includes the two subtrees of height min{log A − i, h − 1}, rooted at its children (if any), the first min{i, h} + 1 fat-nodes on the path from ν to the fat-root and, for the s-th such node, the fat-nodes of a subtree of height h − s − 1 (see Fig. 3 ). Therefore, In order to restore the correct state in all the nodes of the emulator trees within fat-node ν, all the roots of the emulator trees in the h-neighborhood of ν must be connected to the inputs of the local distributor, while all the nodes of the emulator trees in the fat-node must be connected to its outputs. Prior to the emulation, the distributor in each fat-node is prepared so that the root of each emulator tree T u residing in the h-neighborhood can broadcast the sequence of state updates in the current phase to all the nodes associated with vertex u.
Taking a symmetric perspective, every root of an emulator tree in ν must be connected to all the distributors in the h-neighborhood of ν. For the sake of simplicity, let us assume for now that such connections are realized by having each root of an emulator tree be also the root of a dedicated broadcast tree that spans the h-neighborhood, and whose nodes are connected to the distributors. Note that there are i distinct broadcast trees rooted at each fat-node. In Section 5, we will discuss how to employ pipelining to reduce the number of broadcast trees rooted at a fat-node, which will improve the blowup by a polylogarithmic factor in A.
From the above observations it follows that the distributor in a fat-node ν at level i has
Since N i is a factor Θ (h) greater than b i (although no more than b i inputs will ever be active in any emulation), an (N i , N i ) distributor is needed to realize the required multicast operation.
Once the universal circuit is prepared for G, the emulation proceeds similarly to the network emulation of [13] . Specifically, each phase emulates h/4 steps of G as follows. First, in each fat-node, the emulator trees compute the initial state of their roots for the next phase. Then, a root associated with node u ∈ V G broadcasts its state to all the nodes of its broadcast tree, and from there through the distributors to all the nodes of the emulator trees associated with node u in the h-neighborhood of ϕ(u). At this point, all the nodes of the emulator trees are ready for the next phase. Altogether, the overall time to emulate h/4 steps is O (h + log A). When h = ( /2) log A, we obtain a slowdown of O (1/ ). (The factor 1/2 in the definition of h may appear redundant in Section 4 whereas it does play a role in the construction of Section 5.)
Area of the universal circuit
Let us first consider the layout of a fat-node ν at level i. The area of the layout is dominated by the area of the N i -input Beneš network realizing the local distributor. Also, no more than N i broadcast trees are incident on ν. Hence ν admits a square layout of area O N i 2 , where the three communication channels needed to route the broadcast trees from/to ν to/from its father and its two children are incident on three distinct sides of the layout. The high-level organization of the resulting layout is shown in Fig. 2 .
Being tree-structured, the circuit admits an H-tree layout [17] where the nodes of the H-tree are the layouts of the corresponding fat-nodes, and the wire channels between pairs of nodes are wide enough to route all the edges of the broadcast trees traversing the channel. The recursive structure is shown in Fig. 1 .
Let S(A) be the side length of the resulting H-layout. From the above considerations we conclude that
hence the overall area requirement of the resulting design is O 2 A 1+ /2 log 4 A . In the next section, we show how to reduce the area bound of U A by reducing the number of incident broadcast trees and the size of the distributor at each fat node.
Improving the area bound through pipelining
The careful reader has probably observed that both the broadcast trees and the distributors of the construction presented in Section 3 are somewhat underutilized. In fact, only one stage of them is active at any given time, which is the natural scenario for improving performance via pipelining.
The idea behind the use of pipelining is quite simple. Specifically, we partition the i roots of the emulator trees of a fatnode at level i into i /h groups of (at most) h roots each and let each group use a single broadcast tree to distribute state updates to their h-neighborhood. Simple calculations suffice to show that the number of distinct broadcast trees incident on a node at level i is now 
} , hence the overall area requirement of our final universal circuit
whenever the parameter is such that 4 log log A/ log A ≤ ≤ 1.
We have thus proved the main result of this paper, stated in the following theorem. The above construction is valid even for nonconstant values of the parameter . For instance, setting = 4 log log A/ log A yields a circuit of area O A log 2 A , which can emulate all circuits of area A with O (log A/ log log A) slowdown. This area is the same as that of the concentrator fat-tree of [1] while the slowdown is a Θ (log log A) factor smaller.
Further directions of improvement
It is straightforward to check that, in terms of (blowup, slowdown) bounds, most previous universal circuits (e.g., [1, 2, 12, 3] ) are subsumed by some instantiation of our circuit U A . However, our flexible design is not able to match the constant blowup and logarithmic slowdown networks of [5] (word model) and [6] (bit model), since the circuit of Theorem 6 does not admit a realization with constant blowup.
It is natural to wonder whether our construction can be further improved to yield a suitable parametrized universal circuit, whose (blowup, slowdown) bounds can smoothly range from
). In fact, there is still a number of enhancements that can be applied to our construction, which yield extra area savings and push down the minimum value of , for which significant tradeoffs can be achieved, to O(1/ log A). However, since the improved design still falls short of subsuming the circuit in [6] , here we discuss the necessary modifications only at a high level, and leave the details to the interested reader.
In the first place, observe that the pipelining feature introduced in Section 5 is underutilized whenever h = o(log A) (i.e., nonconstant values of ), since the cost of distribution at the root of the fat tree is always Ω (log A). Therefore, we can afford that Θ (log A) (rather than Θ (h)) roots of emulator trees use the same broadcast tree. Also, observe that the h-neighborhood of fat-nodes grows exponentially smaller as we move towards the leaves, hence we can have shallower emulation trees for vertices embedded closer to the leaves, at the cost of more frequent redistributions, whose total cost would still be amortized by the number of steps being emulated. avoid wasting area towards the leaves of the H-layout, we must stop the embedding stated in Corollary 3 at a level i , where the total number of vertices that would be embedded in a subtree rooted at level i is about log 2 A, and simply map all these vertices to the fat-nodes at level i . A simple mesh layout is then sufficient for these fat-nodes.
To quantify the impact of the above changes, simple calculations show that for = 1/log A the resulting universal circuit features an area of O (A log log A) with slowdown O (log A). The blowup is a mere factor O (log log A) worse than the one achieved by the circuit of [6] with the same slowdown.
Further improvements might possibly be obtained by blending the techniques of [6] with those of the current paper, but the required adaptations do not appear to be straightforward. In fact, in [6] the area-A layout is partitioned into log A×log A cells, whose internal communications are emulated by small meshes of the universal circuit and whose external communications are emulated by a concentrator fat-tree that has the meshes placed at its leaves. To achieve locality within the small meshes, it is crucial that all the vertices of the emulated circuit be embedded there, at the leaves of the fat-tree. This is in contrast with the approach of the current paper, where some vertices of the emulated circuit are also embedded in the internal fat-nodes, to achieve locality (constant dilation) within the fat-tree.
Conclusion
We have developed the first efficient area-universal circuit with constant slowdown. In fact, our construction is more general, since it embodies a design parameter and yields slowdown σ = O(1/ ) and area blowup nearly proportional to A , namely α = O(A ), within the range 4 log log A/ log A ≤ ≤ 1. Therefore, the area blowup is exponential in the inverse of the slowdown.
More substantial reductions of the area of the universal circuit, possibly leading to polylogarithmic blowup for constant slowdown, pose a considerably more challenging problem, which appears to require significant progress in our current understanding of constant slowdown simulations of all bounded degree networks.
The ideas developed here might find, most likely after careful fine tuning, applications to the design of versatile and/or reconfigurable hardware architectures. In fact, the potential to simulate any architecture of a given area, guarantees that a reconfigurable architecture with universal properties can execute any task nearly as efficiently as any circuit specialized for that task. A particularly interesting direction to explore in this context is the possibility of developing automatic tools to efficiently map any given application on an area-universal circuit.
