The modern integrated circuit is one of the most complex products that has been engineered to-date. It continues to grow in complexity as the years progress. As a result, very large-scale integrated (VLSI) circuit design now involves massive design teams employing state-of-the art computer-aided design (CAD) tools. One of the oldest, yet most important CAD problems for VLSI circuits is physical design automation, where one needs to compute the best physical layout of millions to billions of circuit components on a tiny silicon surface [21] . The process of mapping an electronic design to a chip involves a number of physical design stages, one of which is clustering. In this paper, we focus on problems in clustering which are critical for more sustainable chips. The clustering problem in combinatorial circuits alone is a source of multiple models. In particular, we consider the problem of clustering combinatorial circuits for delay minimization, when logic replication is not allowed (CN). The problem of delay minimization when logic replication is allowed (CA) has been well studied, and is known to be solvable in polynomial-time [16] . However, unbounded logic replication can be quite expensive. Thus, CN is an important problem. We show that selected variants of CN are NP-hard. We also obtain approximability and inapproximability results for these problems. A preliminary version of this paper appeared in [31] .
Introduction
In this paper, we consider the problem of clustering combinatorial circuits for delay minimization when logic replication is not allowed (CN). Combinatorial circuits implement Boolean functions, and produce a unique output for every combination of input signals [22] . The gates and their interconnections in the circuit represent implementations of one or more Boolean function(s). The Boolean functions are realized by the assignment of the gates to chips.
Due to manufacturing process and capacity constraints, it is generally not possible to place all of the circuit elements in one chip. Consequently, the circuit must be partitioned into clusters, where each cluster represents a chip in the overall circuit design. The circuit elements are assigned to clusters, while satisfying certain design constraints (e.g., area capacity) [16] .
Gates and their interconnections usually have delays. The delays of the interconnections are determined by the way the circuit is clustered. Intra-cluster delays are associated with the interconnections between gates in the same cluster. Inter-cluster delays are associated with the interconnections between gates in different clusters. The delay along a path from an input to an output is the sum of the delays of the gates and interconnections on the respective paths. The delay of the overall circuit, with respect to its clustering, is the maximum delay among all paths that connect an input to any output in the circuit. The problem of clustering combinatorial circuits for delay minimization, when logic replication is allowed (CA), is well studied. It arises frequently in VLSI design. In CA, the goal is to find a clustering of a circuit for which the delay of the overall circuit is minimized. CA has been shown to be solvable in polynomial-time [16] . However, unbounded replication can be quite expensive. As systems become increasingly more complex, the need for clustering without logic replication is crucial. It follows that CN is an important problem in VLSI design.
In this paper, we consider several variants of CN. We prove NP-hardness results for these variants. We design an approximation algorithm for one of them. We also obtain inapproximability results.
The rest of this paper is organized as follows: The problem is formally described in Section 2. We then examine related work in Section 3. In Section 4, we give some hardness results for the clustering problem. We also show that our hardness results imply inapproximability. In Section 5, we propose an approximation algorithm for solving the clustering problem, when the gates are unweighted and the cluster capacity, M, is 2. We conclude the paper with Section 6 by summarizing our main results, and identifying avenues for future work.
Statement of Problems
In this section, we formally describe the problem studied in this paper. We start with graph preliminaries. Next, we formulate the problem using the language of combinatorial circuits. Finally, we represent such circuits as directed acyclic graphs and formulate the main problem using graph-theoretic terminology.
Graph Preliminaries
In this subsection, we define the main graph-theoretic concepts that are used in the paper.
Graphs considered in this paper do not contain loops or parallel edges. The degree of a vertex v of a graph G is the number of edges of G incident with v.
A path of G is a sequence 
l is called the length of the path Q, and sometimes we say that Q is a directed l-path of D. If v 0 = v l , then Q is called a directed cycle. D is said to be a directed acyclic graph (DAG), if it contains no directed cycles.
A cluster is defined as a subset of the vertices of a graph. If C is a cluster in a graph, then an edge is said to be a cut-edge if it connects a vertex of C to a vertex from V \C. The degree of C is the number of cut-edges incident with a vertex in C.
The fanin and fanout of a vertex are the number of arcs that enter and leave the vertex, respectively. A source represents a vertex with fanin equal to 0, and a sink represents a vertex with fanout equal to 0. As the example from Figure 1 shows, a DAG may have more than one source and more than one sink.
Let I and O be the set of sources and sinks of G, respectively. Notice that I = {a, b} and O = {e, f } in the DAG in Figure 1 ; C 1 = {a, c, g} and C 2 = {b, e, f } represent a pair of disjoint clusters.
Formulation of the problem using the language of combinatorial circuits
In general, each gate in a circuit has an associated delay [15] . In the model that we consider in this paper, each interconnection has one of the following types of delays: (1) an intra-cluster delay, d, when there is an interconnection between two gates in the same cluster, or (2) an inter-cluster delay, D, when there is an interconnection between two gates in different clusters.
Note that D >> d, so inter-cluster delays typically dominate in all delay calculations.
The delay along a path from an input to an output is the sum of the delays of the gates and interconnections that lie on the path. The delay of the overall circuit is the maximum delay among all source to sink paths in the circuit.
Technology and design paradigms impose a number of constraints on the clustering of a circuit. So, a clustering is feasible if all clusters obey the imposed constraints. Constraints can be either monotone or non-monotone. In [11] , the following definition is given: Definition 1 A constraint is said to be monotone if and only if any connected subset of gates in a feasible cluster is also feasible.
A typical constraint includes capacity (a monotone constraint), which is a fixed constant M, denoting an upper-bound on the number of gates allowed in a cluster.
In CN, a clustering partitions the circuits into disjoint subsets. A clustering algorithm tries to achieve one or both of the following goals, subject to one or more constraints:
(1) The delay minimization through the circuit [16] . (2) The minimization of the total number of cut-edges [8] .
In this paper, we study CN under the delay model described as follows:
1. Associated with every gate v of the circuit, there is a delay δ (v) and a size w(v). 2. The delay of an interconnection between two gates within a single cluster is d.
The delay of an interconnection between two gates in different clusters is D,
where
The size of a cluster is the sum of the sizes of the gates in the cluster. The precise formulation of the problem is as follows:
CN: Given a combinatorial circuit, with each gate having a size and a delay, intraand inter-cluster delays d and D, respectively, and a positive integer M called cluster capacity, the goal is to partition the circuit into clusters such that 1. The size of each cluster is bounded by M, 2. The delay of the circuit is minimized.
A combinatorial circuit can be represented as a directed graph G = (V, E), with vertex-set V and edge-set E, such that G has no directed cycles. In G, each vertex v ∈ V represents a gate, and each edge (u, v) ∈ E represents an interconnection between gates u and v.
Given a clustering of the combinatorial circuit, the delays on the interconnections between gates induce an edge-length function l : E(G) → {d, D} of G. The weight of a cluster is the sum of the weights of the vertices in the cluster.
Formulation of the problem using graph-theoretic terms
In the rest of the paper, we focus on a graph-theoretic formulation of CN. We employ the following notations and concepts: The length of a path P in G is calculated as the sum of all delays of vertices and edge-lengths of edges of P. X below can be either W , which means that the vertices are weighted, or N, which means that the vertices are unweighted. M is the cluster capacity. ∆ is the maximum number of arcs entering or leaving any vertex of the DAG.
CN is formulated (graph-theoretically) as follows:
, with vertex-weight function w : V → N, delay function δ : V → N, constants d and D, and a cluster capacity M, the goal is to partition V into clusters such that 1. The weight of each cluster is bounded by M, 2. The maximum length of any path from a source to a sink of G is minimized.
A clustering of G, such that the weight of each cluster is bounded by M, is called feasible. Given a feasible clustering of G, one can consider the corresponding edgelength function l : E(G) → {d, D} of G. A maximum length path (with respect to l) from a source to a sink of G is called an optimal path. A clustering of G is optimal, if the length of an optimal path is the smallest. An optimal path with respect to an optimal clustering is called a critical path. 
Fig. 3 Cases of the delay minimization problem that we plan to investigate.
In Figure 2 , we consider a simple example of a clustering of a combinatorial circuit represented by a DAG, where logic replication is not allowed. In this example, the weights and delays of all vertices are equal to 1 (i.e., δ (v) = 1 and w(v) = 1 for all vertices v in the DAG); the upper bound for the weight of the cluster is M = 2; the intracluster delay is d = 1; and, the inter-cluster delay is D = 2. It can be easily seen that the partition Σ = {{s, a}, {b, e}, {c,t}} forms a feasible clustering such that the length of the optimal path is 9. Moreover, it can be checked that this clustering is optimal.
We investigate the delay minimization problem for the cases shown in Figure 3 . In particular, our goal is to obtain reductions among these problems.
In this paper, we focus on a restriction of CN X, M, ∆ , when δ (v) = 0 for any vertex v of G.
The main contributions of this paper are as follows:
1. Establishing the NP-hardness of CN W, M, ∆ and several of its variants (Section 4). 2. Design and analysis of a 3-approximation algorithm for CN N, 2, ∆ (Section 5). 
Related Work
In this section, we describe some related work in the literature.
In [11] , the authors present an exact polynomial-time algorithm for CA. The problem is solved under the so-called unit delay model [11] .
A more general delay model is presented in [15] . The problem of disjoint clustering for minimum delay under the area or pin constraint is shown to be intractable in [15] . To minimize the delay, the authors propose an algorithm which constructs a clustering. This algorithm achieves the optimal delay under specific conditions.
In [16] , CA is considered under the more general delay model proposed in [15] . However, [16] presents a different polynomial-time algorithm. Their heuristic is shown to always find an optimal clustering under any monotone clustering constraint.
Similar to [15] , the problem of disjoint clustering for minimum delay under the area or pin constraint is also shown in [9] to be intractable. However, an improved heuristic is proposed in [9] . The authors also share comparative experimental results which show that a decrease in clusters generally leads to an increase in maximum delay.
In [19] , the authors propose an efficient network-flow based algorithm which determines an optimal partitioning of the circuit. Using the least amount of replication, the optimal partitioning separates the nodes of the circuit into two subsets with the smallest cut size. The algorithm presented in [19] is also applicable to sizeconstrained partitioning.
[25] and [26] explore the advantage of evolutionary algorithms aimed at reducing the delay and area in partitioning and floorplanning. In turn, this would reduce the wirelength. A hybrid of the evolutionary algorithms are used to find optimal solutions to VLSI physical design problems.
In [27] , the authors present an algorithm for simultaneous multilayer interconnect spacing. While satisfying maximum delay constraints, their unique algorithm guarantees to minimize the total dynamic power dissipation caused by an interconnect.
In [30] , adjustable delay buffers (ADBs) are used to minimize clock skew under different power modes. The ADBs have delays which can be tuned or adjusted. When the positions of some fixed number of ADBs are assumed to be predetermined, the authors propose a linear-time optimal algorithm. This algorithm assigns the values of the ADBs so as to minimize clock skew among all possible ADB assignments. In this case, there is a possibility of latency penalty. They also propose a modified algorithm to find an optimal solution with no latency penalty. Additionally, they give an efficient heuristic for finding good ADB positions.
Similar to [30] , the author of [23] studies the use of ADBs to minimize clock skew under different power modes. In order to generate zero clock skew in a given clock tree, they start by assigning ADB positions. If the number of ADBs assigned do not meet the constraints of the previous solution, they use a bottom-up approach for removing ADBs to minimize clock skew while satisfying all constraints.
[24] examines the methods used to solve bi-criterion VLSI circuit partitioning problems. The authors present a hybrid genetic algorithm (GA) which employs the Taguchi method for local search. They test their hybrid algorithm with a variety of benchmarks circuits, and found it superior in comparison to the standard GA and tabu search algorithms reported in the literature.
A routability-driven clustering technique for area and power reduction in clustered FPGAs is presented in [18] . This technique uses a cell connectivity metric to identify seeds for efficient clustering. Effective seed selection, coupled with an interconnect-resource aware clustering and placement, can have a remarkable impact on circuit routability. It leads to better device utilization, reduction in power consumption and savings in area [18] . Additionally, routing area is reduced by 35%. The authors also show that their clustering technique can reduce the overall device power usage by an average of 13%.
In [14] , effective circuit partitioning techniques are employed by using clustering algorithms. The technique presented in [14] uses the circuit netlist in order to cluster the circuit in partitioning steps. It also minimizes the interconnection distance with the required iteration level. For the standard benchmark circuits the well-known clustering algorithms like K-Mean, Y -Mean, K-Medoid are performed. The results obtained in [14] show that the proposed techniques improve the delay. They also minimize the area by reducing the interconnection distance.
The multiway partition problem remains NP-hard, even when the input hypergraph is an unweighted graph, and there is no restriction on the sizes of clusters. If the number of clusters is fixed (say r), then there is an algorithm that runs in time O(n r 2 ) that solves this restriction exactly [7] . Here n is the number of vertices of G. If some prescribed vertices v i of G are given and the goal is to find a solution to the multiway partition problem so that the cluster V i contains the vertex v i , the problem becomes harder. It is proved to be NP-hard for r = 3 and the maximum degree in G is at most 4. On the positive side, this restriction can be solved in polynomial time when the input graph is planar. However, if r is arbitrary then the problem is NP-hard even when G is planar [5] .
The case of the multiway partition problem, in which r = 2, is frequently encountered in literature. This case is called the bipartition problem. It is NP-hard for d-regular graphs [2] , where d ≥ 3 is a fixed constant. On the positive side, there is a dynamical programming based algorithm for solving this problem in the class of trees [1, 6, 13] . Note that the computational complexity of the bipartition problem is unknown if G is a planar graph.
Computational Complexity of CN
In this section, we obtain the main results that deal with the computational complexity of CN. We prove theorems that establish the NP-hardness of some variants of CN. Our reductions imply that CN is inapproximable within a certain factor.
In order to formulate the results, we consider CNWD, which is formulated as follows: It is not hard to see that CNWD is the decision version of CN W, M, ∆ . We make this correspondence explicit by writing CNWD as CNWD W, M, ∆ . We use the same notation for restrictions of CN W, M, ∆ .
Note that CNWD W, M, ∆ is in NP. This follows from the well-known fact that a maximum weighted path in an edgeweighted DAG can be found in polynomial time.
If A is a subset of positive integers, then we denote by CNWD A, M, ∆ , the restriction of CNWD W, M, ∆ , when the weights of verticies of the input DAG are from A.
Our first theorem establishes the NPcompleteness of CNWD W, M, ∆ . Clearly, this means that CN W, M, ∆ is NP-hard.
Theorem 1 CNWD W, M, ∆ is NP-complete.
Proof It is clear that CNWD W, M, ∆ is in NP. This follows from the well-known fact that a maximum weighted path in an edge-weighted DAG can be found in polynomial time.
In order to establish NP-hardness of CNWD W, M, ∆ , we present a reduction from the PARTITION problem.
Recall the PARTITION problem:
PARTITION: Given a set S = {a 1 , a 2 , . . . , a n }, the goal is to check whether there is a set
Without loss of generality, we assume that B = ∑ i∈S a i is even, otherwise the problem is trivial.
We now construct an instance I ′ of CNWD W, M, ∆ as shown in Figure 4 . There is a source s connected to a sink t through n vertices labeled v 1 through v n . None of the v i vertices are connected to each other. Each of the v i vertices has a weight a i , and s and t have weights equal to B 2 . We set D = 1 and d = 0. All vertices are given a delay of 0. The cluster capacity is set to B, and we take k = 1. The description of I ′ is complete.
We observe that I ′ can be constructed from an instance I of the PARTITION problem in polynomial time. In order to complete the proof of the theorem, we show that I is a "yes" instance of the PARTITION problem if and only if I ′ is a "yes" instance of CNWD W, M, ∆ .
Assume that I is a "yes" instance of the PARTITION problem. This means that there exists a partition of S into S 1 and S − S 1 such that
. Group the vertices corresponding to the elements in S 1 with s, and the remaining vertices with t in G. Observe that the packing constraint is met. Moreover, the length of the optimal path from s to t is 1. This means that I ′ is a "yes" instance of CNWD W, M, ∆ .
For the proof of the converse statement, assume that I ′ is a "yes" instance of CNWD W, M, ∆ . This means that there is a way of packing the vertices of G into clusters such that the length of the optimal path from s to t is 1. We observe that every vertex must be packed with either s or t, otherwise the length of the optimal path must equal 2 going through that vertex. Let w(s) and w(t) be the weights of vertices s and t, respectively, and let w(s i ) and w(t i ) be the sum of the weights of vertices packed with s and t, respectively. Clearly, The next theorem serves to strengthen Theorem 1.
w(s) + w(s i ) + w(t) + w(t i ) = 2 · B.

Since w(s) + w(s i ) ≤ B and w(t) + w(t i
)
Theorem 2 CNWD W, M, 3 is NP-complete.
Proof It is clear that CNWD W, M, 3 is in NP.
This follows from the well-known fact that a maximum weighted path in an edge-weighted DAG can be found in polynomial time.
In order to establish NP-hardness of CNWD W, M, 3 , we present a reduction from PARTITION.
We construct a new instance I ′ of CNWD W, M, 3 as shown in Figure 5 . Each vertex v i (i ∈ {1, . . . , n}) belongs to a path which connects the source s to the sink t. Let V denote the set of all v i vertices. Let S denote the set of all vertices that are predecessors to the vertices in V , and let T denote the set of all vertices that are successors to the vertices in V . Since |S| = |T |, let m denote the size of S and T . No pair of vertices in V are connected. Each vertex v i ∈ V has a weight of a i . Every vertex in S and T has weight 1. So, the sum of the weights of all vertices in S is equal to m, and the sum of the weights of all vertices in T is equal to m. We set D = 1 and d = 0. Every vertex is given a delay of 0. The cluster capacity M is set to B 2 + m , and we take k = 1. The description of I ′ is complete.
Observe that I ′ can be constructed from an instance I of the PARTITION problem in polynomial time. In order to complete the proof of the theorem, we show that I is a "yes" instance of PARTITION, if and only if I ′ is a "yes" instance of CNWD W, M, 3 .
Assume that I is a "yes" instance of PARTITION. This means that there exists a partition of A into A 1 and A 2 , such that ∑ x∈A 1 x = ∑ x∈A 2 x = B 2 . Group the vertices corresponding to the elements in A 1 with S, and the remaining vertices with T . Observe that the cluster capacity constraint is met. Moreover, the length of the optimal path from a source to a sink is 1. This means that I ′ is a "yes" instance of CNWD W, M, 3 .
Conversely, assume that I ′ is a "yes" instance of CNWD W, M, 3 . This means that there is a way of packing the vertices of the DAG in Figure 5 into clusters, such that the cluster capacity is not exceeded, and the length of the optimal path from s to t is 1.
Observe that every vertex belonging to S must be clustered together, and every vertex belonging to T must be clustered together. Otherwise, the length of the path from s to t is greater than 1. Additionally, if there is a vertex v i ∈ V which is not packed with either S or T , then the length of the path from s to t is greater than 1. Therefore, V cannot be partitioned into more than two sets.
Let V S and V T denote the subset of vertices v i ∈ V that are packed with S and T , respectively. Observe that V S ∪V T = V . Moreover, the length of the path from s to any vertex in V S must be 0, and the length of the path from any vertex in V T to t must also be 0.
Let w(S) denote the sum of the weights of all vertices in S, and w(T ) denote the sum of the weights of all vertices in T . Notice that w(S) = w(T ) = m. Let w(V S ) and w(V T ) denote the sum of the weights of all vertices in V S and V T , respectively.
Notice that,
Since
This implies that
Thus, we have obtained the desired partition of A. Hence, I is a "yes" instance of PARTITION.
The proof of the theorem follows.
The proof of Theorem 2 implies an inapproximability result for CN W, M, 3 .
Proof Consider the reduction from PARTITION described in the proof of theorem 2.
Observe that in any approximate solution of the clustering problem, there must exist at least one vertex which is not packed with s or t. This means there are at least 2 D-edges along any source to sink path. Hence, any (2 − ε)-approximation algorithm can be used to solve the partition problem exactly. The proof of the corollary follows.
In the proof of the following theorem, we use a 3SAT reduction modeled after the one presented in [9] .
Theorem 3 CNWD {1, 2, 3}, 3, 3 is NP-complete.
Proof It is clear that CNWD {1, 2, 3}, 3, 3 is in NP. This follows from the wellknown fact that a maximum weighted path in an edge-weighted DAG can be found in polynomial time.
In order to establish NP-hardness of CNWD {1, 2, 3}, 3, 3 , we reduced from 3SAT.
For that purpose, we recall 3SAT as follows: 3SAT: Given a 3-CNF formula φ with n variables x 1 , . . . , x n and m clauses C 1 , . . . ,C m , the goal is to check whether φ has a satisfying assignment.
Without loss of generality, for all i ∈ {1, . . . , n} we assume that each variable x i in φ appears at most 3 times and each literal at most twice. (Any 3SAT instance can be transformed to satisfy these properties in polynomial time [28] .) Let each variable x i (1 ≤ i ≤ n), be represented by a variable gadget as shown in Figure 6 (a). Let each clause C j (1 ≤ j ≤ m), be represented by a clause gadget as shown in Figure 6 (b). If a variable x i or its complementx i is the 1st, 2nd, or 3rd literal of a clause C j , then the corresponding vertex labeled x i (orx i ) is connected to a sink labeled C j through a pair of vertices labeled y j1 and z j1 , y j2 and z j2 , or y j3 and z j2 , respectively. We now construct an instance I ′ of CNWD {1, 2, 3}, 3, 3 as shown in Figure 7 . The resulting DAG G represents a combinatorial circuit. Let V denote the set of all vertices labeled x i orx i (1 ≤ i ≤ n). There are n sources T i (1 ≤ i ≤ n) connected to m sinks C j (1 ≤ j ≤ m) through some vertices in V and 3m pairs of vertices labeled y j p and z j p (1 ≤ j ≤ m, 1 ≤ p ≤ 3). Each y j p is connected to exactly one vertex gadget, and for fixed j, no two vertices in {y j1 , y j2 , y j3 } are adjacent to the vertices x i andx i belonging to the same vertex gadget. In other words, x i andx i cannot both be connected to the same clause gadget. Every T i , z j p , and C j has a weight of 1, every x i ,x i ∈ V has a weight of 2, and every y j p has a weight of 3. We set D = 1 and d = 0. All vertices are given a delay of 0. The cluster capacity M is set to 3, and we take k = 3. The description of I ′ is complete. Observe that I ′ can be constructed from I in polynomial time. In order to complete the proof of the theorem, we show that I is a "yes" instance of 3SAT, if and only if I ′ is a "yes" instance of CNWD {1, 2, 3}, 3, 3 .
Suppose that I is a "yes" instance of 3SAT. This means that there exists an assignment of φ such that every clause has at least one true literal. If a literal is set to true, then the corresponding vertex x i (orx i ) should be clustered with T i , but if it is set to false, then the corresponding vertex is clustered alone. Notice that every y j p must be clustered alone. Since each clause C j has at least one true literal, the vertex z j p corresponding to that literal should be clustered alone. This means that the source to sink path going through vertices y j p and z j p corresponding to true literals have length 3. If either of the other two z j p vertices belonging to the respective clause gadget corresponds to literals which are set to false, they should be clustered with C j . Otherwise, they may also be clustered alone. Note that clustering two z j p vertices with C j , even if they both correspond to true literals, leads to paths of length 2 < 3 = k. Observe that the cluster capacity constraint is met, and the length of the optimal path from any source T i to any sink C j is 3. This means that I ′ is a "yes" instance of CNWD {1, 2, 3}, 3, 3 .
Conversely, suppose that I ′ is a "yes" instance of CNWD {1, 2, 3}, 3, 3 . This means that there is a way of packing the vertices of G into clusters of capacity M = 3, such that the length of the optimal path from source to sink is 3.
Since M = 3, again notice that every y j p must be clustered alone. Each vertex C j may be clustered with at most 2 of the z j p vertices. So, at least one z j p is clustered alone. However, notice that any source to sink path with a vertex z j p clustered alone, has length at least 3. In order to satisfy the optimal path constraint, each T i must be clustered with the vertex x i (orx i ) which corresponds to a vertex z j p clustered alone. Otherwise, the length of the path would be 4 > k = 3. To avoid exceeding the cluster capacity, either x i orx i (but not both) may be clustered with T i . Finally, notice that z j p vertices along paths where their corresponding literals are considered false, must be clustered with their respective sinks C j . Otherwise, the length of such a path with all internal vertices belonging to single vertex clusters would have length 4 > 3 = k. Take the variable which corresponds to the vertex clustered with T i , and set its value to true. Take the variable which corresponds to the vertex not clustered with T i , and set its value to false.
By setting to true all literals with corresponding vertices x i (orx i ) clustered with T i , and by setting to false all literals with corresponding vertices not clustered with T i , means that at least one true literal appears in every clause. Thus, a satisfying clustering for G yields a satisfying assignment for φ . Hence, I is a "yes" instance of 3SAT.
The proof of Theorem 3 implies an inapproximability result for CN {1, 2, 3}, 3, 3 .
Corollary 2 CN {1, 2, 3}, 3, 3 does not admit a ( Proof Consider the reduction from 3SAT described in the proof of Theorem 3. Observe that in any approximate solution of the clustering problem, there are at least 3 and at most 4 D-edges along any source to sink path. Hence, any ( 4 3 −ε)-approximation algorithm can be used to solve the 3SAT problem exactly.
The proof of the corollary follows.
The next theorem is a restriction of CNWD W, 2, ∆ .
Theorem 4 CNWD {1, 2}, 2, 4 is NP-complete.
Proof It is clear that CNWD {1, 2}, 2, 4 is in NP. This follows from the well-known fact that a maximum weighted path in an edge-weighted DAG can be found in polynomial time.
In order to establish NP-hardness of CNWD {1, 2}, 2, 4 , we present a reduction from 3-BOUNDED POSITIVE 1-IN-3 SAT (3-BP 1-IN-3 SAT).
For that purpose, we recall 3-BP 1-IN-3 SAT as follows: 3-BP 1-IN-3 SAT: We are given a 3-CNF formula φ with n positive variables x 1 , . . . , x n and m clauses C 1 , . . . ,C m , such that each variable appears in at most 3 clauses. The goal is to check whether φ has a satisfying assignment such that every clause of φ has exactly one true literal [29] .
Let each variable x i (1 ≤ i ≤ n), be represented by a variable gadget as shown in Figure 8 (a). Let each clause C j (1 ≤ j ≤ m), be represented by a clause gadget as shown in Figure 8 (b). If a variable x i is the 1st, 2nd, or 3rd literal of a clause C j , then the corresponding vertex labeled x i is connected to a sink labeled C j through a pair of vertices labeled y j1 and z j1 , y j2 and z j2 , or y j3 and z j2 , respectively.
We now construct an instance I ′ of CNWD {1, 2}, 2, 4 as shown in Figure 9 . The resulting DAG G represents a combinatorial circuit. Let V denote the set of all vertices labeled x i orx i (1 ≤ i ≤ n). There are n sources F i (1 ≤ i ≤ n) connected to m sinks C j (1 ≤ j ≤ m) through some vertices in V and 3 · m pairs of vertices labeled y j p and z j p (1 ≤ j ≤ m, 1 ≤ p ≤ 3). Each y j p is connected to exactly one vertex gadget. Every x i ,x i ∈ V , every F i , z j p , and C j has a weight of 1. Every y j p has a weight of 2. We set D = 1 and d = 0. All vertices are given a delay of 0. The cluster capacity M is set to 2, and we take k = 3. The description of I ′ is complete.
Observe that I ′ can be constructed from I in polynomial time. In order to complete the proof of the theorem, we show that I is a "yes" instance of 3-BP 1-IN-3 SAT, if and only if I ′ is a "yes" instance of CNWD {1, 2}, 2, 4 .
Suppose that I is a "yes" instance of 3-BP 1-IN-3 SAT. This means that there exists an assignment of φ such that every clause has exactly one true literal. If a literal is set to true, then the corresponding vertex x i should be clustered alone, but if it is set to false, then the corresponding vertex is clustered with F i . Since M = 2, every y j p must be clustered alone. Since each clause C j has exactly one true literal, the vertex z j p corresponding to that literal should be clustered with C j . The other two z j p vertices belonging to the respective clause gadget should be clustered alone. Observe that the cluster capacity constraint is met, and the length of the optimal path from any source F i to any sink C j is 3. This means that I ′ is a "yes" instance of CNWD {1, 2}, 2, 4 .
Conversely, suppose that I ′ is a "yes" instance of CNWD {1, 2}, 2, 4 . This means that there is a way of packing the vertices of G into clusters of capacity M = 2 such that the length of the optimal path from source to sink is 3.
Since M = 2, again notice that every y j p must be clustered alone. Each vertex C j may be clustered with at most 1 of the z j p vertices. So, at least two of the z j p vertices are clustered alone. However, notice that any source to sink path with a vertex z j p clustered alone, has length at least 3. In order to satisfy the optimal path constraint, each F i must be clustered with the vertex x i corresponding to a vertex z j p which is clustered alone. Otherwise, the length of the path would be 4 > 3 = k. So as not to exceed the cluster capacity, F i may be clustered with either x i orx i (but not both). Finally, notice that all z j p vertices along paths with an isolated x i must be clustered with C j . Otherwise, the length of such a path would have length 4 > 3 = k.
By setting each literal whose corresponding vertex x i appears in the same cluster with F i to false, and by setting each literal whose corresponding vertex x i is not clustered with F i to true, we have that at least one true literal in every clause. Thus, a satisfying clustering for G yields a satisfying assignment for φ . Hence, I is a "yes" instance of 3-BP 1-IN-3 SAT.
The proof of Theorem 4 implies an inapproximability result for CN {1, 2}, 2, 4 . Proof Consider the reduction from 3-BP 1-IN-3 SAT described in the proof of Theorem 4. Observe that in any approximate solution of the clustering problem, there are at least 3 and at most 4 D-edges along any source to sink path. Hence, the approximation algorithm can be used to solve the 3-BP 1-IN-3 SAT problem exactly.
A 3-Approximation Algorithm for CN N, 2, ∆
In this section, we present a 3-approximation algorithm for CN N, 2, ∆ . Our algorithm makes use of the fact that there is a polynomial-time algorithm for finding a path with a maximum number of edges in DAGs. In each iteration, the algorithm picks a path P with a maximum number of edges. Then it considers the central edge e = (u, v) of P, and puts u and v in the same cluster. After that u and v are removed from G. The algorithm iterates until all edges of the input DAG are exhausted. Proof For a path P, let l(P) be the length of P (i.e., the number of edges of P). Moreover, let l = max
P l(P).
So, l denotes the length of a longest path of G.
The following shows a lower bound for OPT , where OPT is the delay of the optimal clustering of G when M = 2.
Since P represents any path, then the above inequality must also be true for the longest path. Thus,
Now, let us estimate ALG, where ALG is the delay of the clustering found by the algorithm. We will consider 3 cases.
Case 1: l = 1. Then it can be easily seen that ALG = OPT . Case 2: l is even. Then
Case 3: l is odd and l ≥ 3. Then
The proof of the theorem follows. Figure 10 shows an example of a DAG for which the algorithm achieves an approximation factor of 3. 
Conclusion
In this paper, we studied the problem of clustering combinatorial networks for delay minimization when logic replication is not allowed (CN X, M, ∆ ). We showed that several versions of CN W, M, ∆ are NP-hard. The strategy developed for the proofs allowed us to prove that the problem does not admit a (2 − ε)-approximation algorithm for any ε > 0, unless P=NP. On the positive side, there exists a 3-approximation algorithm for CN N, 2, ∆ .
We are interested in the following open problems: 1. Finding an approximation algorithm for CN N, 2, ∆ whose performance ratio is smaller than 3. There may exist a combinatorial approximation algorithm for CN N, 2, ∆ with smaller performance ratio. The following idea may be helpful in the design of such an algorithm. Take a longest path in the input DAG. Put the first two vertices in one cluster, the second two in another cluster, and so on. Remove all the vertices that are clustered with some other vertex. Iterate until all edges of the DAG are exhausted. 
