In this paper we introduce a class of trees, called generalized compressed trees. Generalized compressed trees can be derived from complete binary trees by performing certain`contraction' operations. A generalized compressed tree CT of height h has approximately 25% fewer nodes than a complete binary tree T of height h. We show that these trees have smaller (up to a 74% reduction) 2-dimensional and 3-dimensional VLSI layouts than the complete binary trees . We also show that algorithms initially designed for T can be simulated by CT with at most a constant slow down. In particular, algorithms having non-pipelined computation structure and originally designed for T can be simulated by CT with no slow down.
Introduction
Parallel machines interconnecting up to thousands of nodes have been proposed and recently built. One of the earliest and the most prominent one is a complete binary tree 2, 5] . Many algorithms can be naturally programmed on complete binary trees (e.g., algorithms using a divide-and-conquer strategy) and these networks arise in many applications 1, 2, 5, 8, 13] . In some situations not all the nodes of a complete binary tree machine are of the same type; i.e., some nodes may have more memory and/or processing power than others. In particular, the leaf nodes of a complete binary tree may do the actual processing, while the interior nodes may be simply used as switches 5] . In fact, a number of applications on complete binary trees use the interior nodes for broadcasting or routing the data for most of their execution steps, while the leaf nodes and the root node perform I/O (Input/Output) in addition to computations. E cient VLSI implementations of complete binary trees have also been extensively studied (e.g. see 11, 13] ). One of the fundamental goals in VLSI implementation is to obtain a very compact VLSI chip for an interconnection network. It is well known that an n-leaf complete binary tree has an O(n) 2-dimensional area and an O(n) 3-dimensional volume VLSI layout. This leads to the following question: Does there exist other types of n-leaf tree networks which support a divide-and-conquer paradigm communication structure as well as a complete binary tree but have more compact VLSI layouts? We try to attack this question by proposing a family of trees called generalized compressed trees.
Generalized compressed trees may, in general, be viewed as a derivative of the complete binary tree networks. Intuitively, an n-leaf type k generalized compressed tree is obtained by`merging' at most k non-leaf nodes of an n-leaf complete binary tree into a non-leaf node of the generalized compressed tree. Hence, they have the same number of leaf nodes as complete binary trees but have a total of approximately 25% fewer nodes. We show that generalized compressed trees exhibit better (up to a 74% reduction) 2-dimensional and 3-dimensional VLSI layouts than the complete binary trees. Furthermore, we show that many parallel algorithms (e.g. algorithms encom-passing non-pipelined computation structure), that have been designed for complete binary trees, can be easily implemented to run on generalized compressed trees with no loss in their execution times. The slow down 1 is at most 8 if arbitrary algorithms from complete binary trees are simulated to run on compressed trees. One limitation with compressed trees, however, is that they have small bisection bandwidth (a property similar to trees).
Lo et. al. 10] and Zheng 14] have also proposed variants of complete binary trees.
The basic idea behind their variants is to`merge' 2 or log n nodes of an n-leaf complete binary tree. Furthermore, these variants have good algorithmic properties and compact 2-dimensional VLSI layouts. Our generalized compressed trees include the compressed trees of 14] and the binomial tree machines of 10] as special cases. In fact, complete binary trees are also a special case of the generalized compressed trees. To the best of our knowledge e cient VLSI layouts of binomial trees have not been studied and only 2-dimensional VLSI layouts of compressed trees have been studied 14] (in our formulation compressed trees are similar to type 2 generalized compressed trees). In this paper, we study 2-dimensional (2-d) as well as 3-dimensional (3-d) VLSI layouts of generalized compressed trees. As stated earlier, we also investigate e cient simulations of algorithms that are originally designed for complete binary trees onto generalized compressed trees. The rest of the paper is organized as follows. In Section 2, we formally de ne the generalized compressed trees and give de nitions relevant to the paper. Section 3 discusses 2-d layouts of type 2 generalized compressed trees. We show that type 2 compressed trees have a 2-d layout which uses (approximately) only 56% of the area of the most compact 2-d layout of a complete binary tree. Section 3 also discusses 3-d layouts of type 3 and type 4 generalized compressed trees. We show that type 3 (respectively type 4) compressed trees use only 28% (respectively 26%) of the volume of the most compact 3-d layout of a complete binary tree. In Section 4, we consider e cient simulations of the algorithms originally designed for complete binary trees onto type k generalized compressed trees.
Generalized Compressed Trees
In this section we formally de ne the generalized compressed tree and give the relevant Let G and H be two trees rooted at nodes r g and r h , respectively. The number of nodes in G is designated by P(G) and the number of edges is designated by E(G). The combine operation of G and H, denoted G H, is de ned to be the tree in which r h becomes the rightmost child of r g . Note that G H contains P(G) + P(H) nodes and E(G) + E(H) + 1 edges. The leaf expand operation of G and H, denoted G H, is de ned to be the tree in which every leaf l of G is expanded by a copy C of H such that the parent of l in G is now the parent of the root of C. Let n be the number of leaves in G. Then the number of nodes in G H is P(G) ? n + n P(H) and the number of edges is E(G) + n E(H). Note that H G is a di erent tree than G H.
Similarly, G H 6 = H G. That is, operations and are not commutative. An example for combining and leaf expansion of a 7-node tree T 1 and 4-node tree T 2 is shown in Figure 1 .
We now de ne generalized compressed trees. Let CT k (h) be the k th type of compressed tree of height h, 1 k h. Tree CT k (h) is de ned inductively as follows. Let 
An example of CT 3 (4) is given in Figure 2 . Given a complete binary tree T(h), generalized compressed tree CT k (h) can also be obtained by`compressing' certain nodes of T(h). As an example let us see how we can obtain CT (4) . Similarly, a general strategy may be given to form CT k (h) from T(h), but we omit it from this paper since the de nition given in equation (1) is su cient for our purposes.
It is easy to see that the number of leaves in CT k (h) and T(h) is same, i.e, they both have 2 h leaves. However, the number of interior nodes in CT k (h) is much smaller than the number in T(h) as shown below. Let an interior node of CT k (h) be called a compressed node if the degree of the node is greater than 3 in CT k (h). Note that the root of CT k (h) will always be a compressed node. In order to label the nodes of CT k (h), we can use a labeling scheme that is similar to the one for complete binary trees. Label the root of CT k (h) as , the empty string, and label the (k+1) children of as 1; 2; : : : ; (k+1). In general, if a node v at level l of CT k (h) has label B = b 1 b 2 : : : b l and has t 2 children, then the children have labels B1; B2; : : : ; Bt for 1 l h ? 1.
The number of nodes in CT k (h) is asymptotically 75% of the number of nodes in T(h). In order to see this consider the de nition of a generalized compressed tree from equation (1) . We have P(CT k (h)) = P(CT k (mk + r)). Since 
Using equation (1) with mk = (m ? 1) 
It is now easy to see that for large h and small r (in comparison to h) the ratio P(CT k (h))=P(T(h)) is asymptotically 3=4. We note here that obviously, the best ratio is achieved when r = 0 and hence, for practical reasons one may only consider generalized compressed trees for the situations when h = mk.
VLSI Layouts
In this section, we make a comparison between the 2-d and 3-d layouts of the compressed tree CT k (h) and the binary tree T(h). We show that 2-d layout of CT 2 (h) has a smaller area than the 2-d layout of the tree T(h). Furthermore, the compressed trees CT 3 (h) and CT 4 (h) have a smaller volume in 3-d layouts than the 3-d layout of the tree T(h). In fact, the reduction in the area (resp. volume) amounts to approximately 44% (resp. 72% and 74%). Hence, our results imply that compressed trees can be implemented using smaller VLSI chips and can be used instead of complete binary trees.
A commonly used model for laying out VLSI circuits (e.g. 13]) is to view the circuit as a bounded degree graph G in which the nodes correspond to processing elements and the edges correspond to wires. Graph G is then embedded in a two-dimensional or three-dimensional grid subject to the following assumptions and constraints:
1. Each node occupies unit area. Distinct nodes of the graph are embedded at distinct grid intersection points.
2. Edges have unit width and are routed along grid lines with the restriction that no two edges overlap except possibly when crossing at right angles or when bending (i.e., to form`knock-knees'). Also, an edge cannot be routed over a node it does not connect.
The area of a two-dimensional layout is de ned as the area of the \bounding-rectangle," and it equals the product of the number of vertical tracks and the number of horizontal tracks that contain a node or wire segments of the graph G. The volume of the three-dimensional layout is de ned similarly, and equals the product of the number of horizontal tracks, the number of vertical tracks and the number of tracks in the third dimension. Within three-dimensional layouts, two models, namely the One-PlaneActive and the All-Plane-Active model, have received attention 7, 11] . In the rst model only the grid intersection points on one of the boundary planes are allowed to contain nodes, while in the second model every grid intersection point can contain a l (2) w ( We assign the nodes ; 1; 2; 3; 31; and 32 to the grid points as shown in Figure 3 . We then recursively layout the subtrees CT . We thus have the following result: Theorem 1 Any type 2 n-node compressed tree CT 2 (h) can be embedded into a 2-dimensional grid having area 10 9 n + o(n).
We know that one of the most compact 2-d layouts of a complete binary tree T(h) of height h has area A(T(h)) = 4 2 h ? 4 For h 2k, the maximum degree of the nodes in a compressed tree CT k (h), is k + 2 and hence for k > 2 there would be a`wastage' of tracks in an e cient 2-d layout.
We thus next investigate 3-dimensional layouts of compressed trees. Since the degree of grid points in a 3-d grid is 6, one may hope to obtain very compact 3-d layouts of From the results of Corollaries 2, 4, and 6, we see that there is a considerable reduction in the VLSI layout area and volume of speci c compressed trees. In fact, it is important when compressed trees are to be realized in VLSI. We next show that compressed trees are computationally equivalent to complete binary trees for a class of algorithms that have no`pipelining' embedded in them. More speci cally, an algorithm in which computation is performed level by level at any given time and which is designed for a complete binary tree T(h) can be easily simulated on any type k compressed tree with no loss in its execution time.
Algorithms on Generalized Compressed Trees
In this section, we investigate programmability aspects of the generalized compressed trees. Let A be the class of algorithms in which the computation is performed levelby-level on a complete binary tree. There are many fundamental problems that have a solution in class A (e.g. the problems of broadcasting or the problems of computing reduction functions such as min and max 1, 9, 13]). We show that any algorithm from class A can be simulated to run on a compressed tree with no slow down. In 6], we considered an example of a class A algorithm, namely the parallel pre x computation 1, 9, 13], and showed that given an algorithm to compute parallel pre x on a complete binary tree T(h), it can be easily converted to run on a compressed tree CT k (h) with the same time and space complexity. Here, we outline the main ideas behind the implementations of any class A algorithm onto CT k (h).
Let A be an algorithm from class A. Assume that A is originally designed to be implemented on an n-leaf complete binary tree T(h). At any given time step t during the execution of A on T(h), we have the following two tasks at the nodes in T(h):
Either (1) the nodes on a (pre-speci ed) level i in T(h) perform some computations, or (2) the nodes on level i in T(h) communicate with their children or parents at level i + 1 or level i ? 1, respectively. Note that, without loss of generality, we may assume that both of these tasks take 1 unit of time on T(h). In order to implement A on CT k (h), we simply need to indicate the simulations of the above two tasks. Further, if each of these tasks also take 1 unit of time on CT k (h), then we have the simulation of A on CT k (h) with the same time complexity as T(h); i.e., simulation incurs no slow down. The key idea behind the simulation is that for the nodes that have more than two children (i.e., compressed nodes) in CT k (h), the inputs from these children arrive at staggered times and hence cause no ine cient serializations.
Let T 0 (resp. T 00 ) be the left subtree (resp. right subtree) of height h ? 1 which is rooted at node 1 (resp. node 2) of T(h). Let h = mk + r as before. We know that
can partition CT k (h) into two trees CT 0 and CT 00 such that CT k (h) = CT 0 CT 00 where CT 0 = CT 00 = CT k?1 (k ? 1) ( m?1 i=1 CT k (k)) T(r). Intuitively, tree CT 0 is composed of the root of CT k (h) and the subtrees rooted at nodes 1; 2; ; k of CT k (h). Tree CT 00 is the subtree rooted at node k + 1 of CT k (h). By the de nition of CT 0 , CT 00 and CT k (h), it is easy to see that both the trees CT 0 and CT 00 have the same tree structure and the tree CT 00 is \lowered" by one level (in comparison to the tree CT 0 ) in the compressed tree CT k (h). In fact, trees CT 0 and CT 00 can also be recursively partitioned (in the same fashion) into smaller and smaller trees that have the same structure.
If we inductively let tree CT 0 simulate the computations of the left subtree T 0 of T(h) and let tree CT 00 simulate the computations of the right subtree T 00 of T(h), then our task (namely, the simulation of algorithm A on CT k (h)) simply reduces to simulating either the computation, say C, at the root of T(h) or the communication between the root and nodes 1 and 2 of T(h). Observe that the computation at node 1 (resp. node 2) of T(h) gets simulated by the root (resp. node k + 1) of CT k (h) by this scheme. We can simulate the computation C at the root of CT k (h) with the same time complexity as the one at the root of T(h) because in the algorithm A either root of T(h) or node 1 perform a computation at any given time unit. We can simulate the communication between the root and node 2 of T(h) by the communication between the root and node k + 1 of CT k (h). The communication between the root and node 1 of T(h) can simply take place between the root of CT k (h) and itself (possibly through some assignment statements) because the root of CT k (h) simulates both the root and node 1 of T(h). Hence, there is no delay in the communication. Using induction hypothesis we can assume that trees CT 0 and CT 00 simulate the tasks of the algorithm A originally in the left and right subtrees T 0 and T 00 , respectively, with no slow down and hence, we have completed our simulation of algorithm A on CT k (h) such that it incurs no slow down.
Above, we showed that compressed trees have the same power as the complete binary trees at least as far as the non-pipelined divide-and-conquer type computations are concerned. Obviously, the question of the relationship between the complete binary tree T and the compressed tree CT remains to be investigated when any arbitrary algorithm from T is to be simulated on CT. We can attack this question by formulating the problem of simulation as a graph embedding problem. In graph embedding problems, tree T is embedded into CT so that the cost measures dilation, congestion and node-utilization are minimized 3, 4, 12]. In 6], we present an e cient embedding of T into CT so that the embedding achieves a dilation, congestion and node-utilization of two. This implies that no more than two nodes of T are simulated by any node of CT and the simulation of any general algorithm from T onto CT incurs a slow down of at most 8. We refer the interested reader to 6] for more details. In 6], we also consider e cient embeddings of compressed trees into boolean hypercubes.
Conclusions
In this paper, we have introduced generalized compressed trees. We showed that an n-leaf type 2 compressed tree can be laid out in a VLSI grid having area 2:22n + o(n).
Furthermore, an n-leaf type 3 (resp. type 4) compressed tree has 2:22n + o(n) (resp. 2:072n + o(n)) 3-d layout volume under the All-Plane-Active-Model. Comparing these with the most compact 2-d and 3-d layouts of n-leaf complete binary tree, we nd that compressed trees have smaller area and volume.
We also showed that divide-and-conquer type of algorithms which are designed for complete binary trees with time complexity and in which computation is performed level-by-level, can be simulated to run on compressed trees with time complexity ; i.e, no slow down is incurred by the simulation. For the general case, using graph embedding techniques we have shown in 6] that any algorithm originally designed for a complete binary tree can be simulated to run on a compressed tree with a slow down of at most 8, a constant slow down.
It is not hard to see that the generalized compressed trees have \small" bisection bandwidth. Hence, even though they compare well with complete binary trees in terms of VLSI layouts and algorithms, bisection bandwidth would be a drawback if they are considered as an interconnection network for parallel machines.
Naturally, a number of questions remain to be investigated regarding compressed trees. In particular, e cient 3-d layouts of CT k (h), k = 3; 4 under the One-PlaneActive-Model were not considered in this paper. E cient embeddings of compressed trees into other networks such as de-bruijn, butter y, and shu e-exchange also remain to be investigated.
