Some new results in the complexity of allocation and binding in data path synthesis  by Mandal, C.A. et al.
Pergamon 
Computers Math. Applic. Vol. 35, No. 10, pp. 93-105, 1998 
(~) 1998 Elsevier Science Ltd 
Printed in Great Britain. All rights reserved 
0898-1221/98 $19.00 + 0.00 
PII: S0898-1221(98)00076-5 
Some New Results in the 
Complexity of Allocation 
and Binding in Data 
Path Synthesis 
C. A. MANDAL 
Department of Computer Science, Brunel University 
Uxbridge UB8 3PH, U.K. 
P.  P.  CHAKRABARTI  AND S. GHOSE 
Department of Computer Science & Engineering 
Indian Institute of Technology 
Kharagpur 721302, India 
(Received May 1997; and accepted June 1997) 
Abstract- - In  this paper, we present some new results on the complexity of allocation and bind- 
ing problems in Data Path Synthesis (DPS). We have considered the port assignment problem for 
multiport memories, the Register-lnterconnect Optimization problem (RIO), and the problem of for- 
rnstion of functional units. RIO is a major problem of DPS and we have examined several versions 
of it. The simplest case that we have considered is Register Optimization (RO) for straight line code 
which is solvable in polynomial time. The next more general case that we have considered is RIO 
for straight-line code (SRIO), a special case of RIO, which we have shown to be NP-hard. The most 
significant contributions of this work are results on the hardness of relative approximation of several 
problems of DPS. We have shown that the constant bounded relative approximation of PA for triple 
port memories and SRIO are both NP-hard. © 1900 Elsevier Science Ltd. All rights reserved. 
Keywords--Al locat ion,  NP-complete problems, Data path synthesis, High-level synthesis. 
1. INTRODUCTION 
Allocation and binding problems which have to be addressed for data path synthesis includes 
(a) Functional Unit (FU) formation, 
(b) interconnect formation, 
(c) memory  allocation, and 
(c) register optimization. 
Together these make up a major part of the allocation and binding problem whose complexity 
we examine in this paper. 
The  overall problem of allocation and binding is NP-hard because of some its subproblems 
that are already known to be NP-hard. The  NP-hardness of the register minimization problem 
is one of the oldest complexity results in this area. The  register minimization problem has been 
formulated as a clique partitioning problem in [1] which is a standard NP-complete problem. The  
complexity of connectivity binding has been examined in [2] where results on two restricted cases 
Typeset by .AA43oTEX 
93 
94 C.A. MANDAL et al. 
have 
(i) 
(ii) 
(iii) 
(iv) 
(v) 
(vi) 
(vii) 
been derived. The first case is as follows. Given 
a set of registers R where IRI = p, 
a set of operation types ~, 
a set of units U where IUI = #, and each unit performs a subset of the operations in ¢, 
a state graph G = (V ,A)  where IV[ = v and ]A[ = a, 
a state assignment f ,  : v --* s of vertices to states in S where [S[ = a, 
a binding of arcs in A that extend across state boundaries to registers in R, f r  : a --* r, 
and 
an integer N. 
The problem is stated as follows. Is there a mapping of the vertices in the state graph G to the 
units in U, f~ : v --~ u, such that each unit is used at most once in any state s ~ S and that 
the number of connections between registers and units is less than or equal to N? The second 
problem instance analyzed is as follows. Given a state graph G = (V, A) and a set of registers R, 
a set of function units U, fu : v --* u, and a positive integer N, is there a binding fr  : a --- r, of 
arcs in A to registers in R such that the number of connections between registers and units is less 
than of equal to N? Both the cases have been shown to be NP-complete. The general decision 
problem has also been proved to be NP-complete. 
In our study of the complexity of allocation and binding, we consider several new problems 
which are as follows. 
• The Port Assignment of multiport memories (PA). 
• Register-Interconnect Optimization (RIO). 
• The problem of Functional Unit Formation (FUF). 
In our study, we not only examine the complexity of optimally solving the above mentioned 
problems, but also seek to establish the complexity of finding an approximate solution wherever 
possible. 
In general, for a NP-hard optimization problem we do not expect o find a polynomial time 
algorithm to solve that problem optimally. The only known methods of obtaining optimum 
solutions (like branch and bound techniques) are exponential in time complexity. Data Path 
Synthesis (DPS) problems which occur in practice are usually so complex that enumerative 
approaches like branch and bound are ruled out in many situations. The alternative approach 
is to relax the requirement of finding optimal cost solutions. Designers are often satisfied with 
a fast (polynomial time) algorithm, provided it guarantees some error bounds on the cost of the 
solution. Such algorithms are known as approximation algorithms. There are two well-known 
types of approximation algorithms, viz. the absolute approximation algorithm (which guarantees 
that the solution obtained will not differ from the optimal by more than a fixed constant) and 
the relative approximation algorithm (which guarantees that cost of the solution obtained by the 
algorithm will not exceed, in the case of minimization, the cost of the optimal by a constant 
factor). Absolute approximation algorithms are usually difficult to find for most NP-complete 
problems. The second type of approximation algorithms, namely those of relative approximation, 
are obtainable for many problems where an absolute approximation algorithm is not obtainable. 
Relative approximation algorithms are available for many scheduling problems. For example, list 
scheduling uarantees that the solution obtained in the case of a single operation DAG is never 
more than twice the optimal schedule length. 
We shall establish that for nearly all the allocation problems that we consider, the problem 
of finding a relative approximation itself is NP-complete. Throughout this paper, we shall use 
the following notation: for any problem X, (for which we require an optimal solution) X - R 
shall denote the problem of finding a solution whose relative error is bounded by a constant. 
X - R will also be referred to as a constant bounded relative approximation for X. When we say 
that X - R is NP-complete, we mean that if there is a polynomial time approximate algorithm 
which guarantees a constant relative error bound for X, then P = NP. In this paper, we show 
Data Path Synthesis 95 
that not only are most allocation problems NP-complete, but also that constant bounded relative 
approximation f several versions of the PA and RIO are NP-complete. 
The organization ofthe paper is as follows. In Section 2, we introduce the general node deletion 
problem. We show that this problem, as well as its relative approximations are NP-hard. These 
results form the basis of most of the results derived later in this paper. We introduce the port 
assignment problem in Section 3. These problems arise when multiport memories are used as 
building blocks in data path synthesis [3]. The problem of PA for triple port memories i examined 
in Section 4. For triple port memories, we have proved that the port assignment problem and 
its absolute and relative approximations are all NP-hard. In Section 5, we review the register 
optimization problem and introduce the register optimization problem for straight line code, 
which is known to be solvable in polynomial time [4]. In Section 6, we generalize this problem 
to the register-interconnect optimization problem for straight line code (SRIO). This problem 
is naturally encountered for simple instances for DPS problems. Through this problem, we 
prove that the interconnect optimization as well as its constant bounded relative approximation 
are NP-hard. The NP-hardness of SRIO is a significant result because the problem of register 
optimization for straight line code is solvable in polynomial time and the problem of general 
register-interconnect optimization is already known to be NP-hard. 
In Section 7, we examine a different problem, namely that of functional unit formation (FUF). 
The FUF problem comes up when the schedule of operations does not explicitly indicate the FU 
to which an operation should be mapped. This is often the case when specific FUs are not 
available at the time of scheduling. During allocation and binding, it is then necessary to find a 
suitable mapping of the operations to the FUs. This mapping determines the capabilities, and 
therefore, the cost of the FUs. We prove, in particular, that the problem of minimizing the cost 
of the FUs as well as its absolute approximation are NP-hard. 
2. GENERAL NODE DELET ION 
The node deletion problems have been used in this paper to establish the complexity of many of 
the subproblems encountered during allocation and binding. The node deletion problem (ND) [5] 
is as follows. Given a graph G(V, E), V being the set of vertices and E the set of edges, identify 
a set of nodes Nd for deletion such that the graph, after deletion of nodes in Nd, is bipartite. 
A graph is said to be bipartite if it can be partitioned into two independent sets. A graph is 
bipartite if and only if it can be coloured using two colours, i.e., it is 2-colourable. For the node 
deletion problem, Nd should be the smallest possible set of such vertices. The Node Deletion 
Decision problem (NDD) answers the question whether the given graph can be rendered bipartite 
by the deletion of m vertices, 0 _< m _< IVI - 2. It has been shown that both ND and NDD are 
NP-complete [5]. 
The General Node Deletion problem (GND) is formulated as follows. Given a graph G(V, E), 
identify a set of nodes Nd for deletion such that the graph, after deletion of nodes in Nd, is 
k-colourable. Nd should be the smallest possible set of such vertices. The corresponding decision 
problem (GNDD) is to answer the question whether the given graph can be rendered k-colourable 
by the deletion of m vertices, 0 <_ m _< IVI - k. Such a decision problem will be represented as
GNDD (m, k). 
THEOREM 1. The genera/node letion decision prob]em (GNDD) is NP-complete. 
PROOF. For 2-colourability GNDD is NP-hard because NDD is NP-hard. For k-colourability, 
k > 3, GNDD is NP-hard because the corresponding chromatic decision problem is NP-hard. It 
is easy to write a polynomial time nondeterministic algorithm [5] to solve GNDD. | 
The following corollary follows easily. 
COROLLARY 2. GND is NP-comp]ete. 
96 C.A. MANDAL et al. 
The question of polynomial time approximate algorithms for GND is now examined. An 
approximation algorithm for GND with constant relative rror bound would be one which guar- 
antees dr _ (1 + e)d*, for any instance of GND, where d* is the number of vertices deleted by the 
optimal algorithm, dr is the number of vertices deleted by the approximation algorithm for this 
approximation, and e > 0 is a constant. We denote the problem of obtaining an approximation 
algorithm for GND whose relative rror is bounded by a constant as GND-R. This is also referred 
to as the constant bounded relative approximation for GND. The following important result can 
be proved. 
THEOREM 3. A constant bounded relative approximation for GND (GND-R) is NP-complete. 
PROOF. Suppose that a polynomial time algorithm exists, which guarantees that the relative 
error in the approximate solutions to GND is bounded by a constant. Let it be At. Let G(V, E) 
be an instance of GND, such that the graph is k-colourable. Therefore, d* = 0. This requires 
that Ar should report dr = 0. If this is so, then Ar could be used to solve the chromatic decision 
problem (CDP) [5] in polynomial time. CDP being NP-complete, Ar will be such a polynomial 
time approximate algorithm for GND only if it is also an optimal algorithm for CDP. Thus, 
GND-R must be NP-complete. | 
The particular case for GND to ensure that a given graph will be made three colourable will be 
referred to as GN3D. The following result regarding a constant bounded relative approximation 
for GN3D follows from Theorem 3, since the three colour decision problem is NP-hard. We shall 
use this result several times in the rest of this paper. 
COROLLARY 4. GND-R for 3-colourability, GN3D-R is NP-complete. 
3. PORT ASS IGNMENT 
The variables that are used to specify a behaviour need to be implemented asstorage lements. 
These may be clustered into memory modules of one, two, or three ports. Some work on port 
assignment has been reported in [6]. At this level of abstraction, it is permissible to view inputs 
and outputs of components, aswell as ports of memories as single points in the circuit. A point 
in a circuit is said to access a memory if it transfers data to or from a cell of the memory. Given a 
set of registers being placed in the memory along with the permissible number of ports and their 
capabilities, and the set of accesses to these registers; it is necessary toassign the accessing points 
(in the circuit) to the memory ports so that all the access will be satisfied. The assignment should 
be made in such a way that the resulting interconnect overhead is minimized. The assignment 
process is explained with a simple example. 
EXAMPLE 1. Consider the transfers given below. 
1. a=b+c.  
2. q=c+d,b=p-q .  
3. d=a+r ,c=p- r .  
4. a=p+c,b=q- r .  
Suppose a, b, c, d, p, q, r, and s are registers of which only a, b, c, and d are to be placed in 
the same memory. Assume that three ports are permitted, and the ports are labeled 0, 1, and 2, 
respectively. It will be noted that at most three accesses are made to the memory in any time 
step. Suppose that an adder and a subtracter are used. Let the adder inputs be labeled la and ra 
while the adder and subtracter outputs be, respectively, labeled oa and os. It will be observed 
that la, ra, oa, and os are the only four points accessing the memory in the various control steps. 
They need to be assigned to the ports suitably. Consider the assignment where la, r~, and o~ 
are mapped on ports 0, 1, and 2, respectively, and os is mapped to all the ports 0, 1, and 2. All 
the transfers can be satisfied using this assignment. The connections are illustrated in Figure 1. 
It will be noted that a total of six switches will be required at the ports of the memory shown. | 
Data Path Synthesis 
{1 
Figure 
II l I[ 
ol l l2 
a,b,c,d 
1. Connections to a three port memory. 
97 
It will be noted that a point in the circuit which must access a member of the memory under 
consideration will be connected to at least one of its ports. To reduce interconnect overhead, port 
assignment (PA) should be done so that the minimum number of points is connected to more 
than one port. When a point that reads from the memory is connected to k (k > 0) ports of the 
memory, each connection has to be switched through a multiplexer channel. Similar multiplexing 
is required when multiple sources are connected to a write port of a memory. For the port 
assignment suggested in Example 1, os is mapped to all three ports and so its connectivity is
three, as against he desired level of one. 
An algorithm that finds an optimal PA should lead to an interconnection that requires the 
least number of switches for multiplexing. In this section, we shall deal with the PA problem for 
triple port memories. We shall first introduce the problem. We shall identify a special case of 
the corresponding PA problem to derive the complexity results. 
The PA problem may have several variations, depending on the number and type of the ports. 
The ports may be uniform, being, read/write (rw) or purely read (r); or may have arbitrary 
capability, i.e., rw, r, or w. PA for a single port is trivial. The cases for three ports with uniform 
capability is considered here. The complexity results for this case will help us to derive results 
for the complexity of the SPJO problem. 
4. MEMORIES  WITH THREE UNIFORM PORTS 
Given a set of registers which have been placed in a particular memory with three uniform 
ports, it is necessary to assign the circuit points accessing the memory such that 
(a) all the accesses in each control step are satisfied; and 
(b) the cost of switches for interconnection to the ports is minimized. 
This problem will be referred to as PA3U. First, a relaxed formulation PA3UA1 based on GND 
is presented in Section 4.1. PA3UA1 will be proved to be as hard as GND. PA3UA1 will then be 
used to derive complexity results for PA3U. 
4.1. PA3UA1 and I ts  Re la t ionsh ip  w i th  GN3D 
We now consider a relaxed formulation of PA3U, to be referred to as PA3UA1, which is as 
follows. The accesses to the memory in question are only read accesses, the points that read 
from the memory do not read from or write to any other place in the circuit and these points are 
connected to ezacay one port or to all three ports of the memory. The first condition implies that 
multiplexers are required only at the circuit points and not at the ports. The second assumption 
ensures that the only lines incident at these circuit points are those coming from the memory 
ports. If a point is connected to a single point then no multiplexing is needed. If, on the other 
hand, the point is connected to k (k > 1) ports then a k-to-1 multiplexer will be needed. Such 
a multiplexer is essentially a set of k switches which operate in a mutually exclusive manner. 
The interconnect ost for a k-to-1 multiplexer is then taken as k. The third assumption ensures 
98 C.A. MANDAL et HI. 
that either k = 1 or k = 3. If k = 1, then interconnect ost is zero and if k = 3, then cost is 
three. This special case, to be referred to as PA3UA1, will help us to use the complexity results 
of GN3D. 
Given an instance of PA3UA1, we make the following construction. A graph is constructed 
from the set of transfers for the various control steps. If points Pl and P2 access the memory in 
the same control step, then they must access the memory through distinct ports. In the graph an 
edge is introduced between the vertices corresponding to Pl and P2. If this graph is 3-colourable, 
then a feasible assignment of the points to the ports can be directly obtained. Otherwise, it 
will be necessary to connect some of the points to more than one port, to satisfy the memory 
accesses. Since we are working on an instance of PA3UA1, we shall connect such a point to 
all three ports of the memory. All the conflicts for such a point can always be resolved, and 
the vertex corresponding to this point may be deleted from the conflict graph. To minimize the 
interconnect ost, the number of such points should be minimized. Clearly this corresponds to 
the general node deletion problem to achieve three colourability (GN3D). 
4.2. Complex i ty  of  PA3UA1 
It will be shown in this section that PA3UA1 is just as hard as GN3D. Thus, an approxi- 
mation for PA3UA1 whose relative error is bounded by a fixed constant is NP-complete. The 
transformation described previously leads to the following theorem. 
THEOREM 5. The relaxed formulation, PA3UAI, of the port assignment problem of three port 
memories i NP-complete. 
PROOF. Given a graph G(V, E) for GN3D, an instance of PA3UA1 will be constructed as follows. 
Let r l ,  r2, and r3 be registers packed into the given memory. For each vertex vi E V, let Pi be a 
point in the circuit accessing the memory. For each edge (vii, vi2) E E, find a vertex vi 3 E V, if 
such a vertex exists, such that (vil,v~3) E E and (vi2,v~3) E E. Ifv~ 3 exists, construct the control 
step Plx ~'- r l ,  Pi2 <-- r2, PI8 4-- r3, otherwise construct the control step Pix *--- r l ,  PI3 *"- r2. 
This creates an instance of the port assignment problem with three ports. It is easy to see 
that the deletion of the vertices whose corresponding points are chosen for connection to all three 
ports for conflict resolution will be a feasible solution to GND. This is because the remaining 
points are connected to single ports and their corresponding vertices may therefore be assigned 
three distinct colours. 
It is easy to see that PA3UA1 is in NP. | 
Now, a relative approximate algorithm for PA3UA1 may also be used as an approximate 
algorithm for GN3D with the same error. Such an approximation for PA3UA1 will be referred 
to a PA3UA1-R. Therefore, from Corollary 4, the following corollary follows. 
COROLLARY 6. An approximation for PA3UA1, PA3UA1-R, whose relative rror is bounded by 
a fixed constant is NP-complete. 
Now the complexity of the original three port problem, viz. PA3U, will be analyzed and it will 
be shown that it is also as difficult as PA3UA1. 
4.3. Complex i ty  of  PA3U 
We first consider a special ease of PA3U, PA3U1 which is as follows. The accesses to the 
memory in question are only read accesses, the points that read from the memory do not read 
from or write to any other place in the circuit. We note that PA3U1 and PA3UA1 differ only 
in the way the points may be connected to the ports of the memory. In the case of PA3UA1, a 
point may be connected to either one or all of the ports. Thus, either no multiplexer is needed 
or a 3-to-1 multiplexer is needed. However, for PA3U1, the point may be connected to one, two 
or all three of the ports. In this case either no multiplexer is required or a 2-to-1 or a 3-to-1 
multiplexer is required at that point. We prove the following theorems for PA3U1. These results 
directly carry over to PA3U, it'being a generalization of PA3U1. 
Data Path Synthesis 99 
THEOREM 7. PA3U1 is NP-complete. 
PROOF. Proved by reducing PA3UA1-R to PA3U1. Let Y* be the cost of the optimal solution to 
PA3U1 and X*, the cost of the optimal solution to PA3UA1. Clearly, Y* < X*. Let an optimal 
solution to PA3U1 consist of p~ connected to two ports and p~ connected to three ports. Thus, 
Y* = 2p~ + 3p~ ~ 2p*, where p* = p~ + p~. Thus, p* < (1/2)Y*. 
The p* points of an optimal solution to PA3U1 may be connected to all three ports, so that 
this modified solution serves as an approximate solution to PA3UA1. Let X be the cost of this 
approximate solution to PA3UA1. X = 3p* _< (3/2)Y* _< (3/2)X*. 
Thus the optimal algorithm for PA3U1 could be used to solve PA3UA1-R, proving that PA3U1 
is NP-hard. It is easy to show that PA3U1 is in NP. | 
COROLLARY 8. PA3U is NP-complete. 
PROOF. NP-hardness of PA3U follows from Lemma 7 because PA3U1 is a special case of PA3U. 
That it is in NP can be shown easily. | 
THEOREM 9. An approximation for PA3U1, PA3UI-R, whose relative error is bounded by a fixed 
constant is NP-complete. 
PROOF. Let Y be the cost obtained by an algorithm for PA3U1-R. Let Y* be the cost of the 
optimal solution to PA3U1. Y _< kY*. Let the suboptimal solution to PA3U1 consist of p2 
connected to two ports and P3 connected to three ports. Thus, Y = 2p2 + 3p3 _> 2p, where 
P = P2 + P3- Thus, p _< (1/2)Y. In the lines of Lemma 7, this solution could also be used as a 
solution for PA3UA1-R. Let X be the cost of this solution to PA3UA1-R. X -- 3p _< (3/2)Y _< 
(3k/2)Y* < (3k/2)X*, where X* is the cost of the optimal solution to PA3UA1. | 
COROLLARY 10. An approximation for PA3U, PA3U-R, whose relative error is bounded by a 
fixed constant is NP-complete. 
PROOF. Follows from Theorem 9, as PA3U1-R is a special case of PA3U-R. | 
This completes our treatment of the port assignment problem for triple port memories. We 
shall use the results derived here in the subsequent sections of this paper. We now go over to 
register optimization and register-interconnect optimization problems. 
5. REGISTER OPT IMIZAT ION 
The aim of register optimization (RO) is to minimize the number of registers needed in the 
design [4]. Registers need to be used to store values between control steps. In the context of data 
path synthesis, registers are needed to implement the variables used to describe the behaviour 
of the target system. In addition to the variables declared by the designer, some variables may 
be used at the time of generating intermediate code. All variables in the final implementation 
need to be mapped onto registers. It may be possible to map some of these registers onto on-chip 
memories. Registers and variables will be used in an interchangeable manner. 
A variable is live from the time when it is first defined till the time that value is last used. 
A variable may become live several times during the execution of the program. Two variables 
that are never live at the same time may be merged and placed on the same register without 
affecting the logical input/output behaviour of the program. It is necessary to determine the life 
times of each variable in a program. This is called live variable analysis [7]. Once the life times 
of the variables are known, it is necessary to represent their sharability and perform register 
minimization. 
5.1. RO for Straight Line Code A Solved Problem 
We call a patch of code that contains neither branching instructions nor targets of branching 
instructions, a straight line code. Such code may be represented by a single DAG. The register 
optimization for this case will be referred to as SRO. In practice, some behaviours encountered are 
100 C.A. MANDAL et aL 
operation intensive and do not contain any decision making branches or looping constructs at all. 
Sometimes loops with fixed number of iterations may be unrolled to remove the iteration. The 
intermediate code of such behaviours take a particularly simple form, consisting of only arithmetic 
or logical operations. For such kind of code, it turns out that the complement of the vertex 
compatibility graph of the variables in the DAG is an interval graph. The complemented vertex 
compatibility graph (CVCG) in this case being an interval graph may be optimally coloured, 
using the left edge algorithm [4], in polynomial time, V being the set of vertices in the CVCG. 
5.2. General RO 
In general, however, the sharability will not be as simple as that for SRO and may be repre- 
sented by a graph where there is a vertex for each variable. Two vertices are joined by an edge 
if the lifetimes of the corresponding variables are disjoint. This graph will be called the Vertex 
Compatibility Graph (VCG). The problem of register minimization may now be mapped to the 
Clique Partitioning problem (CP), which is to find the minimum number of disjoint cliques that 
cover a graph. Each clique in the VCG corresponds to a set of variables to be mapped on a single 
register. This general case of RO will be called GRO.  The following result can be easily proved. 
THEOREM 11. GRO is NP-complete. 
A comparative study of the complexity of the Traveling Salesperson Problem, Clique, Colour- 
ing, and Bin Packing has been made in [8]. It has been shown in [9] that no polynomial time 
approximate algorithm is currently known for graph colouring for which the bound on the relative 
error is even close to oo. A very recent result in [10] states that, for the colouring problem there 
is constant e > 0 such that no polynomial time approximation algorithm can achieve a ratio of n ~ 
(to the optimal) unless P = NP. This leads to the following result. 
THEOREM 12. The relative approximation of GRO, GRO-R is NP-hard. 
In the next section, we consider the problem of simultaneously optimizing the register and the 
multiplexer cost. We call this register-interconnect optimization (RIO). RIO based on flexible 
variable binding could be applied to intermediate variables, by breaking up a single contiguous 
life time. This would not lead to a reduction in the number of variables, but could help to reduce 
interconnection verhead. 
6. REGISTER- INTERCONNECT OPTIMIZATION 
Pure RO permits the minimization of registers. However, it has been seen in several design 
examples that pure RO leads to inferior designs, with excessively high interconnect ost. The 
interconnect ost may be estimated by counting the total number of multiplexer channels needed 
at the inputs of the hardware lements used in the circuit. RO performed along with interconnect 
optimization is called register-interconnect optimization (RIO). The formulation of RIO is similar 
to RO and is briefly presented below. 
The register sharability information is identical to that for RO. In addition, adescription of the 
logical netlist is also presented to evaluate the effect of merging two registers on the interconnect 
cost. For a particular merger, the change in multiplexer cost may be computed. The netlist 
needs to be updated at each step to indicate the merger of the two registers. The net effect of 
the register mergers at a certain time may be computed from the updated netlist of the design. 
The RIO problem may be formulated, as that of finding a set of mergers of registers, uch that a 
weighted sum of the register and the multiplexer cost is minimized. Thus, the objective function 
to be minimized may be taken as 
C = WlnrCR + w2nmCM, (1) 
where CR is the register cost and CM is the unit multiplexer switch cost; nr and •m are the num- 
bex of registers and the multiplexer channels, respectively. The cost of a register is proportional 
Data Path Synthesis 101 
to the number of bits in the register. The cost of a multiplexer is proportional to the number 
of lines being multiplexed and the width of its output. If wl and w2 are both taken as 1 in 
equation (1), then the total cost for the registers and the multiplexer channels will be minimized. 
If we consider the RIO problem for a special class of circuits where n,~ in equation (1) will be 
zero, we immediately get a reduction of RIO to RO. This directly leads to the following result. 
THEOREM 13. R/O is NP-complete. 
As for RO, it is possible to distinguish between general RIO (GRIO) and RIO for straight line 
code (SRIO). RIO being NP-complete, GRIO is also NP-complete. It will be shown in Section 6.2 
that SRIO is also NP-complete. Thus, even though SRO is solvable optimally in polynomial time, 
using the left edge algorithm, SRIO is not. In fact, even approximation of SRIO is hard. 
6.1. Reduct ion  of  PA3U to SR IO 
This section presents a reduction of the port assignment problem for three ports, PA3U, to 
RIO for straight line code, SRIO, and, therefore, to RIO also. The PA3U case consists of a set of 
transfers between the memory and the points accessing the memory in the various control steps. 
Each such transfer must take place through a port of the memory. Since only three ports are 
permitted, in any control step no more than three simultaneous accesses to the memory may be 
allowed. Since all the ports are uniform, it is permissible to map a transfer on any of the three 
ports. The mapping is done in two phases. First, a mapping is made to SRO. This mapping does 
not ensure optimality. It is only a prelude to a mapping to SRIO which does ensure obtaining 
an optimal solution. 
The mapping is explained with the help of Example 2. In this example, each access to the 
memory is denoted as air, for the i th access in time step t. While constructing an instance of SRO 
for the PA problem, each such access plays the role of a register. Thus, each a~t is mapped to a 
register t, such that the mapping (i, t) ~ l is unique. It is evident from the construction that 
the lifetime of the register t is the time step t in which the access air takes place. The register 
lifetimes form and interval graph, as expected. This is how the mapping from an instance of 
triple port memory PA to SRO is achieved. 
The total number of registers live in any time step will not exceed three, which is the maximum 
number of accesses to the memory in any time step. It is, therefore, ensured that there exists 
a nonempty set of solutions to the SRO problem requiring not more that three registers. These 
are the solutions that we shall consider. Such a solution will always be obtained by running 
the left edge algorithm. Each register of the solution to the SRO problem instance actually 
corresponds to a grouping of memory accesses in different ime steps. This can, indeed, be 
considered an assignment if this transfers to one of the ports of the memory. Since the solutions 
considered contain no more than three ports, each solution may be used as a feasible solution to 
the PA problem. 
EXAMPLE 2. The creation of the problem instance for SRO and SRIO from a triple port memory 
PA problem of Example 1 is now explained. Shown below are the transfers, the corresponding 
accesses, and the variables for the SRO and SRIO instance. The variables a, b, c, and d are 
packed into a memory for which the PA needs to be done. The points la, [b, • • • are as in Example 1 
(refer to Figure 1). 
Transfers Accesses Registers 
1. a = b+c; all,a21,a31, rl,r2,r3. 
2. q = c + d, b = p - q; a12, a22, a32, r4, r5, r6. 
3. d ~ a -4- r, c -- p - r; a13, a23, a33, rT, r8, rg. 
4. a=p+c,  b=q- r ;  a14, a24, a34, r10, r11,r12. 
102 C.A. MANDAL et al. 
The transfers for the SRIO instance which are needed to construct the netlist are shown below. 
1. r l  4-- Oa, la 4-- r2, ra 4-- r3. 
2. la *-" r4, ra 4-- rs, r6 4-- Os. 
3. r7 *-- Oa, la ~-- r8, r9 * -  Os. 
4. rl0 ~-- Oa, rll ~'- Os~ ?'12 ~-- os. 
The above mapping to SRO, does not attempt o restrict he multiplexer usage and so the 
interconnect ost may be expected to be high. An optimal assignment may be obtained by 
mapping the problem to SRIO. The register lifetimes are same as for SRO. Only the interconnect 
information eeds to be incorporated. The method of extraction of interconnection information 
from the transfers has already been illustrated in Example 2, is as follows. Each point accessing 
the memory now plays the role of a point in a circuit in the SRIO instance. For a transfer between 
a point p and the memory in access air, in the PA example, a transfer between register z and 
point p is created in the SRIO instance. 
We now complete the mapping to SRIO. In order to ensure that the solution will not require 
more than three ports, we fix the weights in equation (1) suitably. Suppose that the total number 
of circuit points accessing the memory is m. In the worst case, each point will be connected to 
all the ports and the number of multiplexer lines needed will be 3m. This is an upper bound on 
the number of multiplexer lines needed for the port assignment. In equation (1), let Wl be set 
to (3m + 2) and w2 be set to 1. Any feasible solution will have cost c _< 12m + 6. It is assumed 
that CR and CM have each been set to 1. 
6.2. Complexity of RIO 
It was shown in Section 6 that RIO is NP-complete. It will now be shown that an approximation 
for RIO, RIO-R, whose relative rror is bounded by a fixed constant is NP-complete. This will 
be proved through a sequence of reductions. First, PA3UA1-R is reduced to SRIO-R. SRIO-R 
is directly reducible to RIO-R and SRIO. This way SRIO-R and RIO-R will be proved to be 
NP-hard. 
Consider an approximation for SRIO, SRIO-R, whose relative rror is bounded by a constant. 
It has been shown that PA3UA1-R is NP-complete. It will be shown that the approximation 
PA3UA1-R is reducible to the approximation SRIO-R. 
THEOREM 14. SRIO-R  is NP-complete .  
PROOF. Given an instance of PA3UA1-R, which is essentially a PA problem for triple port 
memory, consider the following. Let A* be an optimal algorithm for PA3UA1. Let X* be the 
cost of the solution obtained by A*, i.e., the multiplexer lines needed as a result of connecting 
some of the circuit points to all three ports. 
Let B* be an optimal algorithm for SRIO. Let B be an algorithm for SRIO-R. Let Y* be the 
effective cost of the solution obtained by B*, i.e., total number of multiplexer lines used for the 
port assignment. Let Y be the cost of the solution obtained by B. Y _< k2Y*. 
Y* < X*, since the mapping of Section 6.1 ensures that B* yields an optimal solution to 
PA3U1 and because PA3UA1 is a suboptimal formulation for PA3U1. 
Suppose that the solution obtained by B requires P2, two-to-one multiplexers and P3, three-to- 
one multiplexers. Then, Y = 2p~ +3ps =~ P2 +P3 _< (1/2)Y =~ P2 -4-p3 _< (k2/2)Y* _< (k2/2)X*. 
The sum (P2 + P3) is the total number of points which are connected to more than one point 
by B. A suboptimal solution to PA3UA1 may be obtained by connecting these points to all three 
ports of the memory. The cost of this solution would be 3(pl +P2) = X. Thus, X = 3(pl -t-p2) _< 
(3k2/2)X* < k~X*. 
But the error in the suboptimal solution obtained by the above method for PA3UA is bounded 
by a fixed constant. The above method may therefore be used as an algorithm for PA3UAI-R. 
Since PA3UA1-R is NP-complete, it follows that SRIO-R is NP-hard. 
Data Path Synthesis 103 
It may be easily proved that SRIO-R is in NP. | 
The following three corollaries follow from Theorem 14. 
COROLLARY 15. SRIO is NP-complete. 
COROLLARY 16. An approximation for RIO, RIO-R, for which the relative error in the solution 
is bounded by a fixed constant is NP-complete. 
From Corollary 16, also, it follows that RIO is NP-hard. 
7. THE PROBLEM OF FORMING FUNCTIONAL UNITS 
We have explained earlier that the design of the data paths starts with the scheduling of the 
DAGs and is followed by allocation and binding. We consider a practical situation where the 
design has to be performed, subject o a constraint that a specified number of functional units has 
to be used. This constraint leads to the requirement that no more than this number of operations 
should be scheduled in any time step. This is a small departure from a prevalent approach to the 
scheduling problem, where the sum total of the cost of hardware operators i to be minimized 
without consideration for the number of units, where these operators are housed. 
With both the approaches, the objective is to minimize the cost of hardware operators. Sched- 
uling algorithms for the prevailing approach seek to minimize the maximum number of operations 
of each kind that will be performed in any time step. The actual formation of the FUs is usually 
done by binding the operations to the functional units (FU) during the allocation and binding 
step. This process is illustrated and clarified in Example 3. 
EXAMPLE 3. Consider the schedule of operations in Example 1. The scheduling algorithm would 
report that a maximum of one addition and one subtraction are performed in each time step. 
Thus, a minimum of one adder and one subtracter needs to be allocated. The scheduling has 
been done so that a maximum of two operations are scheduled per time step, so that two FUs 
would suffice to actually accommodate each operation in each time step. All the additions could 
be mapped onto one of the FUs and all the subtractions could be mapped onto the other. The 
resulting FUs would be as shown in Figure 1. | 
EXAMPLE 4. Now consider the following transfers. 
1. q=c+d,b=p-q .  
2. d=a+r ,c=p&r .  
3. a=p&c,b=q- r .  
As in the case of Example 3, a maximum of two operations are scheduled per time step and so 
two FUs would suffice to actually accommodate the operations in each time step. The sched- 
uling algorithm would report a maximum of one adder, one subtracter, and one word-and-gate. 
However, an inspection will reveal that an assignment of these operations to the FUs such that 
at most one FU implements one operation is not possible. At least one of the operations would 
have to be assigned to two FUs in order to satisfy the specified schedule. Thus, we see that the 
allocation of one operation of each kind which is implied by the schedule, cannot be satisfied in 
this case. A set of FUs resulting from such an assignment would be (+, - )  and (+, &). | 
Example 4 actually reveals evere difficulty while faced with performance scheduling. While 
scheduling, it is desirable to find a schedule which is realizable using a set of FUs of minimum cost. 
The usual method of estimating the cost of the operator set that would be required to actually 
realize the operations in the schedule is to sum up the costs corresponding to the maximum 
number of operations of each kind. As it has been indicated in Example 4, this method is 
inadequate, as in final assignment of operations to the FUs some more operations of each kind 
could be required, thus violating the allocation implied by the schedule. 
The decision problem of forming FUs from the allocation implied by the schedule is formally 
defined as follows. We are given a schedule of operations, each of a single time step. The 
104 C.A. MANDAL et al. 
schedule consists of p, p > 0 types of single time cycle operations. At most n, n > 3 operations 
are scheduled in each time step. The max imum number of operations of type i in any time step 
of the schedule is m~, mi _< n. The problem is to determine whether an assignment exists of the 
operations in each time step to n FUs, such that no two operations in the same time step are 
mapped to the same FU and no more than mi of the FUs  implement operation of type i. We 
shall refer to this problem as FUFD.  
We would like to determine the complexity of the above problem. To do this, we consider the 
problem special case when there is at most one operation of a particular kind scheduled in any 
time step, i.e., m~ = 1 and a fixed number k of FUs  are to be used. This special case will be 
referred to as FUFD1 (k), for which we have the following result. 
THEOREM 17. For a fixed integer k, FUFD1 (k) is NP-hard. 
PROOF. Given a graph G(V, E) for GND, an instance of FUFD1 (k) will be constructed as 
follows. For each vertex vi E V, let O~ be a type of operation in the schedule. For each clique of 
size l, l < k, construct a time step where the operations scheduled are as follows: o11, o i2 , . . . ,  oit, 
where oj is an operation of type Oj. 
This creates an instance of FUFD1 (k) in polynomial time when k is fixed. The graph will 
be k colourable if and only if an assignment of the operations to the k FUs exist, such that no 
operation is implemented by more than a single FU. | 
We need not be restricted to designing FUs, where the usage of individual operations does 
not violate the initial allocation. Our main concern is to get FUs of minimum cost so that the 
schedule that has been obtained can be realized. However, as the decision problem of designing 
the FUs with a fixed number of FUs is hard, as shown in Theorem 17, the aforesaid minimization 
problem is also hard. 
Theorem 17 establishes the NP-hardness for this problem only when three or more FUs are to 
be used. If only one FU is to be used, then the problem is trivial. We shall now examine the 
problem when only two FUs are to be used. 
When only two FUs are to be used, it follows that at most two operations may be scheduled in 
any time step. Suppose that a maximum of one operation of any kind is present in a time step. 
It is now possible to find out in polynomial time whether the required assignment of operations 
will exist or not. This is because the problem of finding the bipartite partition of a graph, if one 
exists, can be done in polynomial time. However, for our purpose, we need the actual assignment 
of operations ~ to the FUs so that the cost of the units is minimized. We now show that this 
problem is NP-hard. 
THEOREM 18. The problem of determining the actual assignment of single cycle operations to 
FUs when exactly two FUs are to be used, so as to minimize the cost of the FUs is NP-hard. 
PROOF. Consider the special case when exactly one operation of each type is present and an 
operator for each operation has the same cost. Construct a conflict graph G as explained in the 
proof of Theorem 17. It would now be necessary to two colour the graph with just two colours. 
If the colouring succeeds, then an assignment exists that does not violate the default allocation. 
However, if the colouring fails, then it becomes necessary which of these operations hould be 
assigned to both the FUs. In order to minimize the cost of the formation of the FUs, we would 
like to assign as few of these operations, as possible, to both the FUs. This corresponds exactly 
to performing minimum node deletion on the graph G. The operations corresponding to the 
nodes that have been deleted would have to be assigned to both the FUs. It is well known that 
minimum node deletion is NP-complete and this proves the theorem. | 
THEOREM 19. The problem of determining the assignment of operations to a fixed number 
k, k > 1 of FUs, so as to minimize the cost of the FUs is NP-hard. 
PROOF. It is easy to show that this problem is in NP. The NP-hardness of this problem follows 
from Theorems 17 and 18. | 
Data Path Synthesis 105 
8. CONCLUSIONS 
In this paper, we have examined the complexity of several synthesis tasks that are encountered 
during allocation and binding. The problems that we have considered are the port assignment 
problem for triple port memories, the register-interconnect op imization problem, and the prob- 
lem of formation of functional units. We have shown that all these problems in general are 
NP-complete. 
The port assignment problem has recently acquired significance because triple port memories 
are now being used on-chip. Moreover, we have used the complexity results of PA to derive the 
complexity results for RIO. 
We have also examined the complexity of the problem of minimizing the cost of functional 
units, when it is necessary to make a design where the total number of FUs is specified before 
hand. We have shown that this problem is also NP-hard. 
We have examined several versions of the RIO problem. The simplest case that we have con- 
sidered is RO for straight line code which is solvable in polynomial time. The next more general 
case that we have considered is RIO for straight-line code, which we have shown to be NP-hard. 
All the results of SRIO carry over to RIO because SRIO is a special case of RIO. Some results on 
connectivity binding were already available [2]. Our results have been derived using a different 
approach which produces newer insights into the hardness of the interconnect optimization prob- 
lem. The most significant contribution of our work on the complexity of allocation and binding 
are the hardness results of the relative approximation of several subproblems in this area. We 
have shown that the constant bounded relative approximation of PA for triple port memories 
and SRIO are both NP-hard. These hardness results suggest hat the use of deterministic ap- 
proaches to very global formulation of allocation and binding may not prove fruitful, in general. 
Nondeterministic optimization such as simulated annealing [11] and genetic algorithms [12] are 
increasingly being used in DPS. 
REFERENCES 
1. C.J. Tseng and D.P. Siewiorek, Automated synthesis of data paths in digital systems, IEEE Transactions on 
Computer Aided Design CAD-5, (July 1986). 
2. B.M. Pangrle, On the complexity of connectivity binding, IEEE Transactions on Computer Aided Design 
10, 1460-1465, (November 1991). 
3. C.A. Mandal, P.P. Chakrabarti and S. Ghose, Port assignment for dual and triple port memories using a 
genetic approach, In Proceedings of The Third Asia Pacific Conference on Hardware Description Languages 
(APCHDL) 96, pp. 60-64, (1996). 
4. F.J. Kurdahi and A.C. Parker, Real: A program for register allocation, In Proceedings of the ~4th Design 
Automation Conference, (1987). 
5. E. Horowitz and S. Sahni, Computer Algorithms, Galgotia Press, New Delhi, India, (1988). 
6. T.C. Wilson, O.K. Banerjee, A. Basu, J.C. Majithia and A.K. Majumdar, Port assignment inmultiport mem- 
ories for interconnection minimization in data path synthesis, In Proceedings of IFIP Working Conference 
on Logic and Architecture Synthesis, Paris, (May 1990). 
7. A.V. Aho, R. Sethi and J.D. Ullman, COMPILERS Principles, Techniques and Tools, Addison-Wesley, (June 
1987). 
8. M.W. Krentel, The complexity of optimization problems, Journal of Computer and System Sciences 36, 
490-509, (1988). 
9. D.S. Johnson, Worst case behaviour of graph colouring algoorithms, In Prooccedings of 5 th South Eastern 
Conference on Combinatorics, Graph Theory ~ Computing, (1974). 
10. C. Lund and Y. Mihalis, On the hardness of approximate minimization problems, In Proceedings of the 25 th 
Annual ACM Symposium of the Theory of Computing, (1993). 
11. A.A. Duncan and D.C. Hendry, Area efficient dsp synthesis, In Proceedings of the 1995 European Design 
Automation Conference, pp. 130-135, (September 1995). 
12. C.A. Mandal, P.P. Chakrabarti and S. Ghose, Allocation and binding for data path synthesis using a genetic 
approach, In Proceedings of VLSI Design '96, pp. 95--99, (1996). 
