Throughput-driven floorplanning with wire pipelining. by CASU M.R. & MACCHIARULO L.
05 August 2020
POLITECNICO DI TORINO
Repository ISTITUZIONALE
Throughput-driven floorplanning with wire pipelining. / CASU M.R.; MACCHIARULO L.. - In: IEEE TRANSACTIONS ON
COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS. - ISSN 0278-0070. - STAMPA. - 24(2005),
pp. 663-675.
Original
Throughput-driven floorplanning with wire pipelining.
Publisher:
Published
DOI:10.1109/TCAD.2005.846371
Terms of use:
openAccess
Publisher copyright
(Article begins on next page)
This article is made available under terms and conditions as specified in the  corresponding bibliographic description in
the repository
Availability:
This version is available at: 11583/1399095 since:
IEEE
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 5, MAY 2005 663
Throughput-Driven Floorplanning With
Wire Pipelining
Mario R. Casu, Member, IEEE, and Luca Macchiarulo
Abstract—The size of future high-performance SoC is such that
the time-of-flight of wires connecting distant pins in the layout can
be much higher than the clock period. In order to keep the fre-
quency as high as possible, the wires may be pipelined. However,
the insertion of flip-flops may alter the throughput of the system
due to the presence of loops in the logic netlist. In this paper, we
address the problem of floorplanning a large design where long in-
terconnects are pipelined by inserting the throughput in the cost
function of a tool based on simulated annealing. The results ob-
tained on a series of benchmarks are then validated using a simple
router that breaks long interconnects by suitably placing flip-flops
along the wires.
Index Terms—Floorplanning, systems-on-chip (SoC), through-
put, wire pipelining.
I. INTRODUCTION
CURRENT design of large integrated circuits already con-stitutes a formidable problem due to the complexity and
interrelations involved; technology scaling, while representing
the source of new possibilities of innovation and integration,
at the same time forces designers to come to terms with novel
issues. A particularly difficult problem which will soon become
intolerable if not addressed with clear methodologies and
design tools is the increasing relative timing cost of on-chip
communications. It is projected that in a few years the time
of flight of signals will make it possible to reach only a small
fraction of the entire die in a clock cycle, thus possibly slowing
the race toward higher clock frequencies that is indicated in the
International Technology Roadmap for Semiconductors [1].
Furthermore, the very complexity that calls for such a dense
arrangement of computational power on a single chip forces
designers to embrace design paradigms that decrease the cost of
redesign by reusing predesigned blocks (intellectual properties)
whose input/output functionality is guaranteed, and whose
modifications are kept to a minimum. These paradigms allow
much faster time-to-market at the cost of limited flexibility.
In such cases, incorporating latency issues in conventional
design flow (for example, by performing full-chip retiming)
might be impossible, due to either the sheer complexity of
Manuscript received June 28, 2004; revised September 19, 2004. This work
was supported by the Center of Excellence for the Multimedia Radio Commu-
nications (CERCOM), Torino, Italy. This paper was recommended by Guest
Editor P. Groeneveld.
M. R. Casu is with the Politecnico di Torino and CERCOM, I-10129 Torino,
Italy (e-mail: mario.casu@polito.it).
L. Macchiarulo was with the Politecnico di Torino and CERCOM, I-10129
Torino, Italy. He is now with the Department of Electrical Engineering, Univer-
sity of Hawaii, Honolulu, HI 96822 USA (e-mail: lucam@hawaii.edu).
Digital Object Identifier 10.1109/TCAD.2005.846371
the flattened netlist or its actual unavailability. However, re-
ducing IC frequency is not an option, especially for high-speed
designs. This has brought many researchers to try a middle
ground between classical synchronous design and more or less
elaborated asynchronous paradigms that can guarantee func-
tionality by abandoning the all chip synchronous hypothesis.
The introduction of logic that controls the communication be-
tween blocks in a synchronous fashion is a solution that might
represent a good compromise between a theoretically appealing
but practically expensive asynchronous design and a more
traditional approach. In addition, this comes with the advantage
of being perfectly transparent to the designer and guaranteeing
both timing constraints and correct functional behavior with
little overhead. Such design styles allow the adoption of any
possible clock frequency compatible with the specifications
of the interconnecting blocks by intelligently pipelining the
interconnects. However, the existence of computational loops
forces a reduction in throughput that, if not properly controlled,
could jeopardize all the advantage gained by the cycle time
decrease. For this reason, and given that the loop constraints
are derived by the physical placement of interacting blocks, a
strategy that addresses the problem at the physical design stage
is necessary.
In this paper, based on [17], we present a floorplanning and
routing strategy with interconnect pipelining that takes into ac-
count the throughput problem. Furthermore, we highlight the
extent to which these issues influence the full exploitation of a
latency-aware methodology. The paper is organized as follows.
After an analysis of previous work on the topic, in Section III
we establish a few general results on the throughput of inter-
connected systems, and give a description of an algorithm that
enables us to compute the maximum throughput of an intercon-
nection of blocks. A heuristic function that approximates the
throughput cost is described in Section IV. Section V describes
the theory and practice of a flip-flop routing algorithm that is
suited as a back end for the physical design of such systems.
Various experimental evaluations are presented in Section VI,
which is followed by the conclusion.
II. RELATED WORK
The problem of how to insert pipeline stages in long inter-
connects has been recently addressed in its various aspects. At
the system level, Carloni and Sangiovanni-Vincentelli report
a methodology they call “latency insensitive” [2], [4], [5] that
allows the preservation of the functionality of a system that
is assumed working under zero wire delay constraint when
an arbitrary amount of wire latency is added. The modified
0278-0070/$20.00 © 2005 IEEE
664 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 5, MAY 2005
system works by means of a latency insensitive protocol that
requires the addition of a couple of bits, valid and stop, to each
of the communication channels between the various blocks of
the system. While the overhead of the additional logic for the
protocol management is negligible, the routing overhead may
be relevant. We recently demonstrated in [6] another approach
to latency insensitive design that preserves the performance of
the previous approach but does not require the routing of the
protocol signals. We assume these papers as a reference to show
how systems can be made working with wire pipelining without
RTL modifications of the netlist blocks in a way that preserves
functionality. However, the number of valid operations per
cycle, that is, the system’s throughput, may be affected, as
explained in the following section.
Concerning the physical synthesis aspects, Cong and Lim [7],
and more recently Lu and Koh [8] and Chu et al. [9], formulate
the problem of wire pipelining as one of retiming, therefore as-
suming that latches or flip-flops can be moved from logic blocks
to interconnects. In a recent paper, Tong and Young [10] pro-
posed a method for register placement along global wires, given
a retiming solution. However, as already noticed in the introduc-
tion, while optimal, retiming can only be applied if logic blocks
are described at least at RT level and either the logic description
or the post-synthesis netlist is available. That is of course not
possible for hard-IPs. As for soft-IPs, while theoretically pos-
sible, the manual rework to account for pipelined global wires
at RT level might severely impact the design costs. We address
here the case of intellectual properties that designers are likely
to use in a plug-and-play fashion. The combination of retiming
and interconnect latency insensitive design may lead to further
improvements [11].
Other recent papers show how routing and buffer insertion
techniques, all derived in the end from the classic paper by Van
Ginneken [12], can be adapted to routing path construction and
simultaneous flip-flop and repeater insertion [13]–[16]. In [13]
and [14], only pin-to-pin connections are considered, while this
paper also considers the case of multipin connections, like in
[15]. None of these papers, however, addresses floorplanning is-
sues or how netlist connections translate into final performance
achievable from the buffer-flip-flop routing after a given place-
ment. As far as we know, these issues were addressed for the first
time in our preliminary work [17], of which this paper is an ex-
tension with new improvements. More specifically, with respect
to the original paper, we give here a proof of the properties used
for deriving the exact and approximate throughput estimation al-
gorithms, a new flip-flop routing strategy with better throughput
savings, comparisons with other possible cost functions, and de-
tailed data rate considerations for frequency variations.
III. THROUGHPUT EVALUATION
In order to be able to assess the performance of a design in
the presence of added latency, it is necessary to consider a struc-
tural constraint. This kind of constraint has emerged in various
disguises in the history of high throughput design (for example,
see the iteration bound of Messerschmitt et al. [18]), but it con-
sists in a relatively simple hard limit: Cycles introduce func-
tional dependency in which overall latency is of the essence.
Fig. 1. Throughput limitation example.
We can follow the problem in the example of Fig. 1. In Fig. 1(a),
two blocks are shown to be connected by a bidirectional link in
such a way that, in every clock cycle the output of A is needed
by B and vice versa. This is as tight a communication constraint
as can exist but, as long as both connections are supposed to
take no time, no performance bottleneck arises. If, on the other
hand, we suppose that, for any reason one of the two connec-
tions is physically so long that it needs a pipeline in order to
guarantee the maximum frequency constraint, it is clear that the
system made up of the two blocks (and by extension any larger
system incorporating it) will no longer be able to work at the
maximum throughput. For, if at moment both A and B issue
valid data and , at the next clock tick , while block B
is able to produce the next valid datum for A, having processed
, block A will not operate on any valid datum, given that its
next legal input was delayed by one clock cycle by the pres-
ence of the pipelining element . Strictly speaking, this implies
that the system as it is will not work: At time [Fig. 1(c)], block
A will have performed a transition based on nonvalid data [la-
belled “?” in Fig. 1(b)] and will produce invalid data that will
propagate through the system. This small example shows that
simply adding a pipelining element modifies the functionality
of the system (mostly unpredictably). To preserve it, we are left
with two alternatives: to move existing memory elements from
logic to wires or to control the synchronicity by selectively con-
trolling the elements’ clocks. The first alternative borders with
classical retiming approaches, and it might be seen as the op-
timal solution: When in need of a memory element to segment
a long connection, we reclaim one or more flip-flops from one
or both of the connected blocks, thus changing at the same time
the connectivity and the block in order to guarantee the same
functionality.
If, for the reasons described in the introduction, we rule out
such an alternative, the remaining solution is that of control-
ling the single block in such a way that it reacts only to valid
signals. This requires an overall clock scheduling that will acti-
vate the various units in a coordinated way. A discussion of how
this scheduling is enforced and whether a global control (as pro-
posed by [22]) or a distributed approach [5], [6] is preferred is
outside the scope of this paper; here, it is sufficient to say that
such a control is possible. With this in mind, let us follow how
such a strategy could be employed to “legalize” the situation of
Fig. 1. At time 1, block A should be prevented from operating to
avoid logical errors: let us suppose then its clock is gated. Block
B will be able to operate normally, and therefore will produce
new output (dependent on its state and on input which it
can read). At time 2, A will finally be able to compute its next
output (based on its valid input ); at the same time, B
has to be stopped as the datum it needs from A (that is, ) is
CASU AND MACCHIARULO: THROUGHPUT-DRIVEN FLOORPLANNING WITH WIRE PIPELINING 665
Fig. 2. Combined loop.
just about to be computed, and therefore not ready yet. From
this evolution it is clear that, even though the whole system pro-
ceeds in a synchronous fashion and evolves on the basis of the
clock ticks, any single signal is stalled from time to time. In
this example, both A and B’s output are stalled once every three
clock cycles, though not at the same time. It is possible to prove
not only that this result is possible (i.e., there exists a schedule
that will allow a throughput of 2/3) but also that this is the best
possible result for that cycle: The presence of a new pipelining
element in the cycle has the consequence of adding one clock
delay to any computation that starts at a block and ends at the
same block. A computation that would have taken two clock cy-
cles (the number of elements in the loop) to complete will now
take 2 1 clock cycles, thus degrading the average performance
to 2/3.
Interaction between loops is such that the most critical one
dictates the overall performance. An example is shown in
Fig. 2: Even though the left loop might proceed at an average
throughput of 1/2 as a self-standing subsystem, the presence
of a common edge with the smaller maximal throughput of
2/5 forces it to slow down accordingly. It can be shown that
this is always the case, i.e., the loop with the smallest ratio
always controls the upper bound of the throughput,
in each strongly connected component of the graph.
Before drawing a general conclusion, we need to define
schedules formally and see how a block’s schedule can influ-
ence neighboring blocks or pipelining registers. Let us define:
Definition: A schedule function for a block is a function
mapping natural numbers into {0, 1} such that
implies that the block is active at clock tick , and vice versa.
Definition: The time function of a block is a function
mapping natural numbers into natural numbers such that
implies that the th valid emission of the block is scheduled at
clock tick .
Schedule and time functions are alternative ways of speci-
fying the schedule of a block (time function being a sort of inver-
sion of the cumulative schedule). The time function for a block
can be easily derived from its schedule function, as shown for
Fig. 3. Example of schedule and time functions.
the case of Fig. 3: We simply have to report as what is re-
ported as clock tick count under the th occurrence of a 1 in the
schedule.
Schedule function is a simpler way to describe what the
single block is doing from the point of view of clock gating
(and is therefore the easiest representation for implementation
purposes). However, the time function representation can be
used to derive interesting properties of schedules between
connected blocks.
Lemma: If an input of a pipelining element (be it another
pipelining element or a block) has a time function , the
time function is a valid schedule for the
element .
Proof: In order for the register to be able to emit the th
valid signal, it needs to memorize it at least in the previous clock
tick, so it must receive it 1 clock cycle before.
Lemma: If an input of a block has a time function ,
the time function is a valid schedule for
the element .
Proof: In this case, the block, after operating on the th
signal, will be able to emit its th output in the following
clock cycle. Note that the lemma makes the assumption that the
sequential element behaves as a Moore machine. The assump-
tion will be held true for all the following.
Theorem 1: A strongly connected component of a block
netlist always admits a schedule which allows an average
throughput of , where and are, respectively,
the number of blocks and the number of flip-flops on the loop
that makes the expression minimal, and no schedule exists that
allows a better average throughput.1
(Partial) Proof: The proof is articulated in two main parts:
we first show that simple cycles satisfy the theorem, and then,
that any fully connected topology can support a schedule that
satisfies it.
Simple Loops: Given a loop with blocks and wire
pipelining elements [as exemplified by Fig. 4(a)], it is possible
to use any schedule that contains active slots out of
slots in a period. To show this, let us consider any such schedule
and arbitrarily assign it to a block. Then, using properties
expressed by lemmas we can derive acceptable schedules for
all the following elements, until we find another valid schedule
for , which can be expressed (due to the associativity and
commutativity of the transformations described in the lemmas)
as . Now, the schedule has
active slots out of so that must be smaller
1The complete proof of the following theorem is not reported because it is
outside the scope of the paper.
666 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 5, MAY 2005
Fig. 4. Basis of induction and inductive step.
than , thus confirming the validity of the schedule. Any
schedule with a smaller ratio will necessarily generate a
, not satisfying the constraint: Therefore, no better
schedule exists.
General Topology (sketch of proof): Take any loop of
the topology with the most critical (lowest) ratio. The
procedure outlined above will give a valid schedule for all the
nodes and edges inside such loop [the solid loop in Fig. 4(b)].
Use the schedule that will maximally distribute the number of
zeros in the schedule function (this is necessary to guarantee the
success of the extension, but in certain cases it can be avoided).
Then, extend the schedule to all other nodes by following the
same basic ideas. Whenever a node has more than one input,
only the one that leads to the latest arriving signal has to be
considered for schedule calculation. Each calculated node
needs to be propagated throughout its transitive fanout, until no
schedule adjustment is possible. The schedules of the critical
loops should not be touched. It can be shown that in no block
will the extension cause a new input to be scheduled before it
has to be actually processed, and that the process terminates (a
detailed proof requires techniques derived from cost-to-time
ratio problem theory, as they are applied in [6]).
It might be asked whether only loops can contribute to
throughput degradation. Reconvergent fanouts outside loops
with branches holding a different number of delays need resyn-
chronization, and therefore the fastest branch has to slow down
its source, in order not to lose this data. However, this case
substantially differs from that of loops in that throughput can
be increased by adding flip-flops, while loops can better their
performance only by deleting them. The constructive proof of
the previous theorem implies such a conclusion, wherein the
basic minimal buffering assumption is equivalent to the possi-
bility of adding the correct amount of pipelining elements for
each input that would need such an addition to equalize paths.
Although it is relatively easy to add elements (for example,
at the blocks’ terminals), removing them might be impossible
(removing a block will cause a timing violation, and the system
will not be able to work at the given frequency). It can be shown
that resynchronization can always be achieved by appropriate
scheduling and/or communication protocols (details omitted,
but see [6]), such that the system can operate at its theoretically
maximum throughput.
In conclusion, a block netlist’s best possible average
throughput is the minimum value of , where and
are the number of blocks and delay elements, respectively, of
a loop.
Even if the preceding proposition gives a complete theoret-
ical answer to the question of throughput evaluation, it does not
address the practical issue of how to evaluate real cases. In fact,
the very structure of the cost formula, in which some nodes in
the loop (blocks) increase its value, while others (delays) de-
crease it, makes it impossible to tell up front if the minimum
will be found in the longest loop, or in the shortest, or some-
where in between. It would seem that the only solution is to tra-
verse all possible loops, a hopeless task for all but the simplest
cases, given the exponential nature of loop enumeration.
However, computation can be simplified. To see how, let us
define a couple of terms.
Definition: The length of a path is the number of its blocks.
Definition: The partial cost of a path is , where
and are, respectively, the blocks and delays on the path.
Then, the key observation is expressed by the following
Theorem 2: Given two paths with common start and end
nodes, if they have the same length, and if the partial cost of the
first is greater than the partial cost of the second, it can never be
the case that the second is part of a loop whose cost is higher
than the first.
Proof: The two paths have the same length, meaning
that , the number of blocks of the first path, and , the
number of blocks of the second path, are equal. The partial
cost is for the first and for the
second. The hypothesis on the partial costs is equivalent to
. Therefore, , due to the
equality of the s. In conclusion, . Now, the cost of
any loop is a monotonous function of the ratio , as can be
seen by rearranging the fractions. For any possible path closing
the loop between start and end, we have costs that are monoto-
nous with and ,
respectively. Therefore, as , the two fractions have
the same denominator, and the first one is always greater than
the second one.
In practical terms, this means that, while traversing the netlist
in search of the maximum cost (which is to say, one
minus the minimum throughput ), it is not
necessary to record each path through which we visit a single
node, but at most as many as the possible path lengths (limited
by the overall number of nodes). The pseudocode that follows
illustrates the basic structure of the algorithm.
Function computeExactCost (fVin; Eing)
Input: GRAPH = fVin; Eing
Output: COST
exactCost(fVin; Eing);
forEach v 2 V do
listPartial(v) = f(0; 0)g;
forEach vElse 2 V   v do
listPartial(vElse) = 0;
end
Next = fvg;
while 9vc 2 Next do
forEach vo 2 Fanout(vc) do
forEach (mv ; nv) 2 listPartial(vc) do
tempCost = (mv + 1; nv + intDist(vc; vo));
CASU AND MACCHIARULO: THROUGHPUT-DRIVEN FLOORPLANNING WITH WIRE PIPELINING 667
if !9(mv + 1; nold) 2 listPartial(vo) : nold >
nv + intDist(vc; vo) then
if vo == vc then
updateWorstCost(tempCost);
end
listPartial(vo) = listPartial(vo)[
tempCost;
Next = Next [ fvog;
end
end
end
Eliminate vc from Next;
end
end
return WorstCost
listPartial is a list of partial costs that accumulates on each
node visited in the breadth-first search, and is limited by the total
number of nodes due to the previous theorem. intDist computes
the number of register elements needed to connect the two nodes
in order to evaluate the relevant parameter of the cost function.
Finally, updateWorstCost updates the value of current worst
loop found.
The search is repeated for each starting node, but all nodes al-
ready used are marked (not shown in the pseudocode) so that no
other path through them is considered in subsequent searches
(all loops through them are already counted). Therefore, the
maximum loop that passes through a given node can be as-
sessed in a time bounded by the number of edges (each edge
is traversed only once for each possible partial length) multi-
plied by the number of nodes (maximum number of partial
costs recorded), and the overall problem has a complexity of
.
The algorithm was tested on classic MCNC (ami33, ami49,
apte, hp, xerox) and some GSRC benchmarks (n10, n30, n50,
n100) [29]. Average and maximum CPU times were 0.2 and
1.1 s, respectively. This is sufficient for performance evaluation,
but too slow to be included in a floorplan optimization loop.
For this reason, in the following we will describe the heuristic
alternatives we evaluated and the solution we pursued.
IV. FLOORPLANNING FOR THROUGHPUT
The high computational cost of the exact throughput was not
the only reason that brought us to look for a different cost func-
tion to integrate in a floorplanning environment. There are also
intrinsic reasons to look for a different way of expressing the
cost. To discuss the question, we focus on a floorplanning im-
plementation. In particular, we decided that the best choice of
a floorplanning strategy was that of a simulated annealing ap-
proach. The availability of efficient and compact representa-
tions for floorplans (slicing and not) containing all feasible solu-
tions with a minimal redundancy (corner block list [23], O-tree
[24], Sequence Pair [25], to name just a few) has made simu-
lated annealing the preferred method to implement optimiza-
tions on more diverse cost functions (such as array modules
[26], routability and buffer planning [27], congestion control).
Now, three conditions will decide whether or not a simulated
annealing optimization will succeed.
1) The representation should be compact, exhaustive, and
efficient.
2) The cost function must be easy to compute.
3) The cost function must be devoid of strong discontinu-
ities.
While the first condition is satisfied by practically any of the
aforementioned alternatives, both the second and the third make
a strong case against the exact evaluator previously described.
Simple modifications of the evaluator were tried, as detailed in
the results section, but did not give satisfactory performance. An
annealing process is able to settle on an acceptable solution only
if its temperature schedule is sufficiently slow, and this already
makes the evaluator impractical. Also, especially when we get
close to the end of the annealing process (precisely when small
variations on the cost will mean much for the final solution), the
exact function is not smooth at all, due to the high sensitivity
to small local changes: A minimal increase in a net’s length
could introduce a new delay in a loop, thus abruptly increasing
the throughput cost. So, even if throughput itself is our goal, it
might be a good idea to avoid including it explicitly in the cost
function.
We needed a function that follows closely the real throughput
trend (possibly monotonical with it) while being smoother and
faster to compute. In order to make it as smooth as possible,
we looked for a function of the whole floorplanned circuit
rather than a maximum value. It had to be computed on the
entire netlist and to depend strongly on the possible presence
of delay on critical loops. A suitable function can be computed
as follows.
1) Before entering the annealing iteration, we statically
evaluate a weight for each pin to pin net, the inverse of
the shortest loop the net belongs to.
2) At each iteration, we consider each pin to pin connec-
tion and evaluate their Manhattan distance.
3) The distance is divided by the maximum length admis-
sible between clocked elements, and the integer part of
the result is taken.
4) This last number is multiplied by the weight computed
in the first point.
5) All such values are summed.
A pseudocode computeHeuristicCost for the pre-
ceding heuristic follows (shLoop represents the length of the
shortest loop containing the edge).
Function computeHeuristicCost (fVin; Eing)
Input GRAPH = fVin; Eing
Output: COST
heuristicCost (fVin; Eing);
Cost = 0;
foreach b 2 Block do
foreach bo 2 fanOut(b) do
Cost+ = (intDist(b; bo))=(shLoop(b; bo));
end
end
return Cost
668 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 5, MAY 2005
Fig. 5. Correlation between exact and heuristic throughput evaluation.
The rationale behind this function is simple. By dividing the
Manhattan distance between two pins by the maximum length,
we get an estimate of the minimum number of delays on that
edge (all multipin nets are implicitly decomposed into pin to
pin nets); each of these delays will count more or less according
to the loop of maximum cost the net belongs to.
As the exact evaluation of this cost is very computationally
expensive (it amounts to the algorithm of Section III), we
suppose that small loops will have a great impact on the cost,
and weigh every single delay so found according to the inverse
length of the loop. Summing over the entire set of nets smooths
the function sufficiently to make it suitable for an annealing
process. It is true that the cost function herein described will
rather optimize an average throughput than the best case,
but a similar objection can be made for timing-driven cost
functions. The rationale is that, on average, by reducing the
mean throughput cost of cycles, the worst case will be affected
in the right direction. The requirement for a smooth function
in the annealing process makes a strong point in favor of this
choice, and experimental results will show that the results are
actually consistent with the expected optimizations. We ran
some experiments in which the exact cost function was used
(on an example—n10—in which this was possible, considering
the computational cost). We observed that the introduction of
an exact but abrupt evaluation did not allow any optimization
whatsoever. We believe this is due to the difficulty that even
a technique as robust as simulated annealing has in directing
optimization with extremely peaked cost functions.
The loop computation is performed through a simple breadth
first search starting from the destination of the edge at stake.
Besides, being outside the loop, it represents a small fraction of
CPU time.
Our experiments showed a pretty high correlation between
this cost and the exact throughput. Fig. 5 reports the 1-comple-
ment of the exact throughput versus the heuristic cost for a floor-
plan of GSRC benchmark n100 [29] with many different values
of the maximal distance. Similar behavior has been observed in
the other benchmarks, thus confirming the good quality of the
heuristic herein described. The connected line shows the case
of a specific floorplan with varying length constraints, while
the other points show results from different optimized floor-
plans. It is apparent that the two costs are closely related. Even
though the measure cannot be used for an absolute evaluation
of a system’s throughput, it can certainly be used inside an op-
timization engine because heuristic and exact cost are monoto-
nously correlated. It is interesting to observe that the heuristic
function presents a smoother variation than the exact one, as we
wanted in order to include it in the evaluation loop of the an-
nealing process.
The curve appearance could raise the question of whether a
different function could give rise to a more direct relation be-
tween heuristic and exact throughput, possibly benefiting the
optimization process. For this reason, we tried to use a func-
tion of the single loops’ behavior that could give rise to a more
linear trend, but all experiments resulted in a similar or worse
optimization results (not shown for space constraints). However,
we do not exclude that a more elaborate function could result in
a better matching between the heuristic and real throughput and
at the same time guarantee a better optimization.
On the other hand, wirelength and throughput are very loosely
correlated, if at all (see Section VI for examples).
We conclude this section with two observations.
1) Even though the cost function is based on the nets’
lengths, it is not too strictly correlated with the overall
wirelength, due to the integer division step. We ob-
served experimentally that its main effect is that of
taking into account only critical nets which are nor-
mally a small fraction of the total. Also, even critical
nets that are not included in short loops (for example,
nets from and to the terminals) do not contribute to the
cost.
2) The computational cost of the evaluation is very small,
because the most expensive work is done statically out-
side the loop.
V. FLIP-FLOP ROUTING
The number of flip-flops evaluated within the floorplan
framework is a simple function of the Manhattan distance
between two pins. Every point-to-multipoint (P2MP) net is
decomposed into point-to-point (P2P) connections. As a result,
the estimated number of flip-flops in a path connecting two pins
of a P2MP net can be different from what is obtained after a
routing aiming at minimizing wirelength, and so the number of
flip-flops, for each net. In general, the throughput is maximized
when the overall number of flip-flops in the paths forming
the loop is minimum, while a router normally deals with one
net at a time. Reducing the wirelength in each net can be a
contrasting objective and may lead to further throughput degra-
dation. In order to validate the results of the throughput based
floorplanning, we built a simple router that places flip-flops
along the interconnects such that given timing specifications are
respected. We did not take into account many important issues
like congestion and the need to use an accurate wire delay
model even though we are aware of the inaccuracies of this
approach. Blockages are also not considered. This assumption
is justified by the already claimed need to exploit the porosity
in large IP’s layout for buffer insertion [20] that might also
CASU AND MACCHIARULO: THROUGHPUT-DRIVEN FLOORPLANNING WITH WIRE PIPELINING 669
be used for flip-flop insertion. The underlying assumption for
wiring blockages is that two layers of metal are reserved for
global routing that is allowed to run over the cells. However,
for the purpose of verifying our approach, the router described
in the following is sufficient.
The router inputs are the positions of nodes to be connected,
internally represented as a graph . The minimum
spanning tree (MST) is first built using the rectilinear distance.
Then, it is used by an algorithm that approximates the minimum
rectilinear Steiner tree (MRST) using a heuristic that adds nodes
to the graph, aiming at reducing wirelength by sharing common
segments (and buffers and flip-flops if needed) among the sinks
of the net. The field of efficient Steiner tree building is still being
explored (see [21]). We alert the reader that the following algo-
rithm is far from optimal in terms of the complexity and quality
of the achieved heuristical MRST, which was outside the scope
of our work. Its use is only propaedeutical to the flip-flop plan-
ning algorithm.
The algorithm works as follows. At each step of the algo-
rithm, a node that has not yet been connected is treated.
All the unconnected nodes surrounding are first orga-
nized in four sets of sinks based
on their position (north, east, south, or west) with respect to
node . A single node may belong to more than one set
(e.g., both north and west, south and east). The direction for the
next routing is . For each set, the min-
imum segments that may be shared among the corresponding
nodes are computed . For instance, if
and are the positions of nodes and
respectively, and (i.e., , ),
then , . The direction is
then chosen according to
which maximizes segment
sharing. Another node is then added. If the direction was
north, then . Node inherits from
all nodes belonging to the set
corresponding to the chosen direction, while loses them.
Then the algorithm proceeds in a breadth-first fashion visiting
nodes inherited by . After the exploration of the subtree of
, the algorithm goes back to , whose remaining directions
can be explored by reapplying the heuristic on the nodes that
still have to be visited. With this approach, the added nodes
stay on the Hanan’s grid, built according to the position of
the original nodes. The algorithm is graphically exemplified in
Fig. 6.
After routing, a delay model is applied to the tree in order to
evaluate if the timing constraints set at its leaves are respected.
We suppose that optimal buffering is performed and then we
compute the timing slacks. If some of the constraints are not
met, the tree has to be legalized by adding some flip-flops.
Every branch, i.e., every segment connecting two nodes, is
then marked “legal” or “illegal.” The legality is determined by
computing the required time at each node starting from the
leaves of the routing tree where a required time is set. In [17],
we used the following approach for evaluating the legality. If
is not a branching node with only one child , and if the directed
edge has delay , the required time is
given by . If the edge child is
Fig. 6. Example of heuristical MRST construction.
marked “illegal;” otherwise, it is legal. If is a branching node
with children, all the corresponding “candidate” required
times , , are computed and used as necessary to
mark as illegal the related edges. Since one node can only have
one required time, if at least one of the candidates is positive,
is set at the minimum of such positive values. This strategy
leaves unaffected the already “legal” paths passing through the
node. This function is basically a depth-first search, and its
complexity amounts to . An illegal branch followed
by a legal branch is then chosen as a feasible region for flip-flop
placement.
The described approach has a drawback. We stated be-
fore that we annotate a branching node with the worst
(i.e., minimum) required time among the different required
times coming from its edges: ,
children . Suppose that has two children, and , and
that is worse than such that .
Suppose that the upstream branch from node that ends in
is marked illegal because . This
means that the arrival time . Then a flip-flop will
be placed in to legalize it. The latency at nodes and
increases by 1, which is certainly correct for . However, it
might be the case that the path to would have been already
legal without the flip-flop addition, i.e., . These
circumstances are exemplified in Fig. 7(a) and (b), where a
timing constraint violation at node traces
back to the root whose . As a consequence,
a flip-flop is added at , therefore increasing the latency
at both nodes and . However, was already legal being
. This leads to a conservative solu-
tion that sometimes might worsen the system throughput if the
added flip-flop is within a worst loop. Nonetheless, this can be
easily avoided with an improvement we implemented in a new
670 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 5, MAY 2005
Fig. 7. Illegal edge (z; u) is detected according to the old algorithm [17] because of the timing constraint violation a(v) > r(v) that traces back to the root:
r(z) < 0 (a). Consequently a flip-flop x is placed along (z; v) (b). According to the new algorithm (z; u) is no more illegal because r (z) > 0 (c). Thus only
(u; v) is marked illegal and x is placed along (u; v) (d).
algorithm with respect to [17] described by the pseudocode
evalBranchLegality.
Function evalBranchLegality
Input: , node
evalBranchLegality ;
foreach Children do
evalBranchLegality ;
//delay from to ;
;
;
if then
mark illegal ;
else
mark legal ;
end
end
;
;
Instead of just recording the worst required time, both worst
and best required times may be annotated at each
node of the tree. Therefore an edge is judged illegal only
if both values and are negative. The latter con-
dition states that no path exists that ends in leaf-node
and passes through node such that . We formalize
it in the following theorem.
Theorem: Given a tree , a node , and a timing con-
straint at the leaves, if and , then
leaves and .
Proof: It can be demonstrated ab absurdo. Suppose that
exists from to leaf-node such that
and . Among the various leaves of that have node
as predecessor there will certainly be leaves and with
corresponding minimum and maximum arrival times and
(possibly coincident). The length of the path from to
is while from to is
. When we trace back from the
leaves toward the root we compute minimum and maximum re-
quired times at each node by subtracting the length of the path
from that node to its children. As a consequence, when we reach
the node , the minimum required time equals the re-
quired timing constraint at node minus the length of the
path to in
(1)
Similarly, the maximum required time is
(2)
Now, leaf ’s arrival time
, where we used (2). If ab absurdo
it follows that, by rearranging the inequality,
, which is contrary to the hypothesis.
CASU AND MACCHIARULO: THROUGHPUT-DRIVEN FLOORPLANNING WITH WIRE PIPELINING 671
When we trace back to the root node , if at least one of
and is positive, then there exists at least one
legal path. That is, when node is the root, the inverse of
the previous theorem holds. This is detailed in the following
corollary.
Corollary: Given a tree , and a timing constraint at the
leaves of , if and only if and , then
.
Proof: The “if” part is the same of the previous proof
with . As for the “only if” part suppose that one be-
tween and is . Let us take . Ac-
cording to (2) there is at least one minimum length path with
the arrival time such that
, because the arrival time
(beginning of the tree). From it follows
that . Therefore, there is at least one path which is
totally legal.
To summarize the consequences of the theorem and its corol-
lary, when both and are 0, none of the paths
are already legal and we can proceed with the legalization algo-
rithm described in the following. If one of them is , at least
one legal path exists that does not need flip-flops for legaliza-
tion. In the latter case, to complete the legality check, the tree has
to be visited once again in search of those nodes whose arrival
time does not respect the maximum required time. Therefore, if
there still remains some edge with , then
is marked illegal. This is exemplified in Fig. 7(c), where
we compare the new legality check method with the previous
method in [17] depicted in Fig. 7(a). Min and Max required
times at node are ( 1,3). With the method in Fig. 7(a), the
edge is marked illegal while in Fig. 7(c) it is not, because
path to leaf-node is already legal. Therefore the tree is visited
and the edge is marked illegal because
. Consequently, the flip-flop is added in in Fig. 7(d),
while in Fig. 7(b) it was placed in . The latency at node
is not increased. As we will see in Section VI, this leads to an
improvement in throughput with respect to the results shown in
[17].
After the legality check, the legalization consists in the fol-
lowing steps repeated from the leaves toward the root until the
whole tree is legal: 1) place flip-flop in a suitable location within
an illegal branch and 2) re-evaluate the legality from the source
to the remaining sinks and to the added flip-flop.
Once the feasible region is found, the correct position has to
be determined. Let be the suitable branch. The flip-flop
is placed by evaluating the point in whose arrival time
equals the latch required time. The new point is added to the
MRST. For our experiments we used a geometric delay model
for based on a linear function (then invertible) of the
simple wirelength . Although this might be inaccurate,
we judge it suitable for the purpose of this work [7], [19].
Once the flip-flop is inserted, delay and legality from the root
to the leaves or to the added flip-flops are recomputed, again as-
suming that the best buffer assignment is performed. The sub-
trees following the inserted flip-flops are already legal by con-
struction. The algorithm for legalization is described by function
legalize. The complexity of the function, which is basically
a depth-first search on the net graph, is in the best
case, if everything is already legal, and in the
worst case, if all branches are illegal.
Function legalize (GfV;Eg)
Input: MRST = GfVin; Eing, node u
Output: MRST = GfVout; Eoutg
legalize(GfV;Eg; u);
V = Vin, E = Ein;
foreach v 2 Children(u) do
legalize(GfV;Eg; v);
if (u; v) is illegal then
find location n in (u; v) such that a(n) 
r(n) = 0;
V = V [fng, E = E[(u; n)[(n; v), E = En(u; v);
reapply the delay model to the modified
tree;
evalBranchLegalityNew(GfV;Eg; root);
legalize(GfV;Eg; root);
end
end
return GfV;Eg
VI. EXPERIMENTAL RESULTS
A. Experimental Setting
We integrated the algorithms described in the previous
sections into an existing publicly available simulated an-
nealing floorplanner based on the sequence pair representation
PARQUET (see [30]). On the floorplanner side, our modifi-
cations consisted of empowering the existing framework by
making it possible to deal with pin directions (which is fun-
damental for our method, but immaterial for normal floorplan
problems); compute short loops, exact and heuristic throughput;
and add one optimization mode which, in analogy with the
corresponding wirelength optimization, uses a weighted sum
of area and throughput cost. Even if this tool is far from a fully
engineered implementation of the method, we believe that the
results reported below help understanding the original features
of the problem we introduced.
B. Benchmarking
A somewhat sensitive issue is that of which benchmarks to
choose in order to validate our approach. Classical [Microelec-
tronics Center of North Carolina (MCNC)] and even more re-
cent benchmark suites (such as the Gigascale Silicon Research
Center (GSRC) series [29]) lack a fundamental feature for our
purposes: clear direction information. Even when this informa-
tion is actually present (as in the .nets format of GSRC series)
all pins are marked as “bidirectional,” which is practically use-
less for our purposes. To be more precise, even if a net is ac-
tually a bidirectional bus, it is never the case that the channel
of communication between the two or more blocks it connects
is accessible during the same clock period; considering it to be
so, besides substantially restricting the achievable optimization
(all bidirectional connections, treated this way, amount to the
shortest possible loop of two blocks, thus automatically fixing
at 0.5 the maximum achievable throughput when 1 delay is
present), misrepresents the functional situation. For this reason,
672 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 5, MAY 2005
TABLE I
FLOORPLANNING RESULTS
we decided to assign directions to all pins. We first adopted the
strategy suggested in [28] of making the last pin of a net the
net source. This strategy worked perfectly for the four GSRC
benchmarks we analyzed (n10, n30, n50, and n100), but it was
not viable for the classical MCNC benchmarks (ami33, ami49,
apte, hp, and xerox), because this would have led to loopless net-
works. For this reason, we assigned randomly one pin per net as
its source. This solution, of course, could generate unnaturally
easy or hard problems, and to test it, we introduced one more
benchmark that was built ad hoc from the IBM placement suite
where, on the contrary, pin directions of gates are indicated.
However, the mixed standard-cell block structure of the bench-
marks, made it impossible to employ them in a methodology
such as the one herein proposed, that presupposes large blocks
with fairly constrained mutual communication. Therefore, we
took one of the simplest benchmarks of the suite (ibm02), par-
titioned it with hMetis [31] into a netlist of 30 large blocks
with their own input and output terminals, and randomly picked
200 nets, preserving their direction information. With the ob-
tained block level netlist, we proceeded as for the other bench-
marks. We also experimented with the use of synthetic bench-
marks like GNL [32], proceeding in a similar way, partitioning
flat gate level netlists into larger blocks. The results obtained
(i.e., achieved throughput, area, wirelength) were comparable
with all other results obtained with the classic floorplanning
benchmarks.
C. Results for Floorplanning
In this section, we present our results in terms of throughput-
driven floorplanning. The main aim of our work was not to
compare with other floorplanning strategies, but rather to give
some indications of how the throughput problem could be
partially solved with the use of a cost function like the one
described in Section IV. For this reason, we launched our
floorplanner with four different cost functions for each case
under consideration: Area, Area + Half Perimeter Wire Length,
Area + Heuristic Throughput cost, and Area + Wire Length +
Heuristic Throughput. For the area and wirelength, and area
and throughput optimizations, we used a balanced weighting
of the two costs (half each), while in the three-objective case
we used proportions of 0.4, 0.3, and 0.3, respectively. We ran
each case ten times, with a time limit of 60 s on a Linux ma-
chine with a Pentium III processor at 1.4 GHz, with 700 MB
of RAM and 512 kB of cache. The usage of a time limit means
that the number of iterations varies for the three optimiza-
tions. However, the wirelength and throughput optimizations
have a similar execution time per iteration. Incidentally, this
confirms the efficiency of the cost function and its easy inte-
grability with existing frameworks. For each case, we gathered
three average and minimum results for whitespace, wirelength,
and degradation of throughput (computed with the exact al-
gorithm). A result of 0.391/0.25, for example, means that the
benchmark has an average throughput of 1–0.391 and a max-
imum throughput of 1–0.25. Please take note that we express
throughput in this way throughout the paper and, therefore,
a smaller value implies better performance. Also, for each
benchmark, we considered four different critical lengths: 0.3,
0.5, 0.7, and 1.0 times the minimum die edge computed as
the square root of the total block area. Results are reported
in Table I. Line ami33.50, for example, reports the results for
CASU AND MACCHIARULO: THROUGHPUT-DRIVEN FLOORPLANNING WITH WIRE PIPELINING 673
Fig. 8. Average area and wirelength for the various cost functions.
Fig. 9. Data rates for the various cost functions.
benchmark ami33 where critical length is 50% of the min-
imum floorplan edge. As the dimensions of the blocks are of
the essence for this method (i.e., a floorplanning problem with
critical length of 30% of the die is bound to be different from
an analogous problem with a 100% critical length), the units
of the length and areas are different for each example, but are
not reported for simplicity.
Fig. 8 shows average area and wirelength under different
optimization objectives [area, wirelength (WL), throughput
(THR), combined throughput and wirelength (WLTHR)]. The
left axis’ units are percentages of the total area of the blocks:
An area of 106 represents an average of 6% whitespace. The
right axis’ units is 1000 of length units. Fig. 9 shows how data
rates (frequency by throughput) are influenced by the various
cost functions, depending on the critical length. Arbitrary units
are used to fit the figure size.
It is apparent from Fig. 8 that, provided some cost related
to interconnect length and/or throughput is taken into consider-
ation, area penalties are similar. On the other hand, throughput
comes at the cost of wirelength, for which it can be conveniently
traded, as shown by the WLTHR results. Fig. 9 shows an inter-
esting increasing trend for data rates with deeper pipelining (no
matter what optimization is used), which will be investigated in
Section VI-D.
The results allow us to draw a series of conclusions.
1) THR optimization really achieves better throughput re-
sults. Average gain with respect to area and wirelength
minimization are 25% and 11%, respectively. If we
consider only the gain in the case of the longest crit-
ical length, the two gains are 64% and 24%, respec-
tively, thus suggesting the existence of a threshold crit-
ical length over which the gain becomes substantial.
2) Wirelength and throughput minimization are goals that
are not as closely correlated as it might be thought.
3) Different benchmarks behave differently as long as
throughput is concerned, and their difficulty is not a
simple function of other features, let alone of their
complexity (number of blocks, number of nets).
4) The task is in and by itself inherently difficult. There
is no chance of getting an acceptable throughput with
area optimization, while wirelength minimization also
can lead to highly suboptimal results (see, for example,
the case of n100).
5) The last three columns (three-objective optimization)
show that sometimes wirelength can be recovered by
trading in a small amount of throughput, even though
the effect of the combined optimization is bench-
mark-dependent. In general, we experimented with
different weights to the various optimization goals,
finding results within the bounds of the corner cases
of only wirelength and only throughput minimization.
This can help in managing a tradeoff in highly con-
gested designs.
D. Frequency Versus Throughput Considerations
An abstract consideration might be used against any
throughput optimizing strategy that aims at increasing fre-
quency through a latency-independent method. It follows these
lines: Given a working implementation for which each connec-
tion is just shorter than the critical length, at a frequency ,
another implementation at a double frequency is possible
that introduces one pipelining element on each edge. Then, if
the system is not purely feedforward, the theory admits a degra-
dation of throughput of 1/2, thus giving a data rate of ,
neither better nor worse than the original case, with the added
disadvantage of requiring a more critical clock distribution
network, with all the related issues. So, it is important to see if
the doubling of the frequency always calls for a halving of the
throughput, or if the optimization process might actually end
up with substantially better results.
For these reasons, we took the various benchmarks and made
a sweep of the critical length parameter, then graphing the data
rate. We repeated the experiment with three different cost func-
tions: The heuristic throughput cost described in Section IV,
a quadratic length, and a simplified version of the exact cost,
wherein only loops of maximum length 4 are counted. The
reason for adding the other two cost functions was also to com-
pare the proposed heuristic with other possible approaches. In the
quadratic case, we used as a cost the sum of the squares of the eu-
clidean lengths of the pin-to-pin connections belonging to loops,
the rationale being that a quadratic cost will tend to favor shorter
674 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 5, MAY 2005
Fig. 10. Overall performance of n100.
connections with respect to a Manhattan distance. The truncated
exact cost, on the other hand, tames the complexity of the exact
computation by limiting the depth of the search to loops of length
4. The results, for benchmark n100, are shown in Fig. 10, where
the axis represents the critical length (in percentage of the
minimum die edge). The values represent the average over five
random seed placements using the different values for the critical
length. The topmost line is the curve obtained using the heuristic
cost that behaves much better than the more “accurate” cost
function that characterizes the “reduced exact” curve. Finally,
even if the quadratic cost behaves better than the pseudo exact
cost, it is still outperformed by the heuristic, which proved to
be the best in all cases, for all examples we ran.
It appears that, ceteris paribus, reducing the critical length
is always beneficial in increasing the overall performance (data
rate). This apparently counterintuitive conclusion has to be
weighted against the simplifying assumptions herein made:
simple inverse proportionality between critical length and
frequency, abstraction from detailed circuit implementations,
intrinsic logic limitations (it is not possible to achieve arbitrarily
high frequencies due to combinational logic delays), clock tree
synthesis, and skew control. Nonetheless, the graph shows there
is abundant design space to proceed toward faster systems.
E. Results for Tree Flip-Flop Routing
The columns in Table II indexed by Rout compare the results
obtained after floorplanning for one of the ten trials with those
after flip-flop routing. Results are reported for both old [17] and
new algorithms. Degradation with the old routing strategy is
monotonous, with an average loss of 16%. On the other hand,
the new routing algorithm shows a clear improvement, giving a
moderate loss of only 6% on average.
The other columns in Table II report the number of pipelining
registers after routing, and compare it with two measures: A half
perimeter measure of the necessary registers (min HPWL), and
the number of FF required by routing each pin-to-pin connection
separately (P2P). The first value (min HPWL) tries to capture
the result of a good routing, in the sense that, for each multipin
net, the half perimeter length is evaluated, and then divided by
the critical length. The rationale behind this is that an HPWL is
TABLE II
ROUTING RESULTS
known to be always a correct measure of wirelength for connec-
tions with up to three pins, and generally an underestimate for
nets with more pins. We point out that, should the source of the
net be in the middle of the net’s bounding box, a clever routing
could give a smaller number of flip-flops (if some of the P2P dis-
tances are smaller than the critical length, for example, they will
not introduce any pipelining elements). This explains why our
routing strategies, in some cases, can produce a smaller number
of elements than HPWL. On the other hand, the P2P values are
to be an upper bound on the number of registers. Averages in
the table are normalized with respect to the ideal throughput and
HPWL respectively.
The number of elements is always close to or better than the
HPWL values, thus indicating a good overall routing quality; at
the same time, especially in the case of short critical length, the
savings with respect to the P2P strategy can be substantial (see,
for example, n100).
VII. CONCLUSION AND FUTURE WORK
The problem of the insertion of flip-flops in interconnects for
wire pipelining in deep-submicron large size integrated circuits
CASU AND MACCHIARULO: THROUGHPUT-DRIVEN FLOORPLANNING WITH WIRE PIPELINING 675
has been tackled in this paper from the perspective of the poten-
tial throughput degradation this technique may imply. In order
to limit this detrimental countereffect, we inserted a modified
cost function in a simulated annealing based floorplanner. Such
a new cost metric models the latency of the interconnects as a
simple function of the Manhattan distance between two points
in a layout and then evaluates the throughput reduction. To the
best of our knowledge, this work is the first to systematically
address this problem in general and to propose a viable yet ef-
fective solution.
The experimental results demonstrate the advantages of the
new approach compared to standard cost functions like area
or wirelength. The final throughput achieved by the new floor-
planner has been tested using a flip-flop routing algorithm suit-
ably developed and used as a back-end tool. The results consis-
tently agreed on all benchmarks.
Future work will address many effects not considered here.
On the optimization side, the floorplanner framework will be
better explored to fine tune its performance and allow treatment
of soft blocks. As for the router, congestion and various block-
ages shall be considered for a more realistic evaluation. Another
key point is the adoption of a better delay model. Finally, an ef-
fort shall be made to gather more appropriate benchmark date.
REFERENCES
[1] The International Technology Roadmap for Semiconductors, SIA, 2003.
[2] L. P. Carloni et al., “A methodology for “correct-by-construction” la-
tency insensitive design,” in Proc. ICCAD, 1999, pp. 309–315.
[3] T. Chelcea and S. M. Nowick, “Robust interfaces for mixed-timing sys-
tems with application to latency-insensitive protocols,” in Proc. DAC,
2001, pp. 21–26.
[4] L. P. Carloni and A. L. Sangiovanni-Vincentelli, “Performance analysis
and optimization of latency insensitive protocols,” in Proc. DAC, 2000,
pp. 361–367.
[5] L. P. Carloni et al., “Theory of latency-insensitive design,” IEEE
Trans. Computer-Aided Design Integr. Circuits Syst., vol. 20, no. 9, pp.
1059–1076, Sep. 2001.
[6] M. R. Casu and L. Macchiarulo, “A new approach to latency insensitive
design,” presented at the DAC, San Diego, CA, Jun. 2004, pp. 576–581.
[7] J. Cong and S. K. Lim, “Physical planning with retiming,” in Proc.
ICCAD, 2000, pp. 2–7.
[8] R. Lu and C.-K. Koh, “Interconnect planning with local area constrained
retiming,” in Proc. DATE, 2003, pp. 442–447.
[9] C. Chu et al., “Retiming with interconnect and gate delay,” in Proc.
ICCAD, 2003, pp. 221–226.
[10] D. K. Tong and E. F. Y. Young, “Performance-driven register insertion
in placement,” in Proc. ISPD, 2004, pp. 53–60.
[11] L. P. Carloni and A. L. Sangiovanni-Vincentelli, “Combining retiming
and recycling to optimize the performance of synchronous circuits,” in
Proc. SBCCI, 2003, pp. 47–52.
[12] L. P. P. P. Van Ginneken, “Buffer placement in distributed RC-tree net-
works for minimal Elmore delay,” in Proc. ISCC, 1990, pp. 865–868.
[13] R. Lu et al., “Flip-flop and repeater insertion for early interconnect plan-
ning,” in Proc. DATE, 2002, pp. 690–695.
[14] S. Hassoun et al., “Optimal buffered routing path constructions for
single and multiple clock domain systems,” in Proc. ICCAD, 2002, pp.
247–253.
[15] P. Cocchini, “Concurrent flip-flop and repeater insertion for high perfor-
mance integrated circuits,” in Proc. ICCAD, 2002, pp. 268–273.
[16] , “A methodology for optimal repeater insertion in pipelined inter-
connects,” IEEE Trans. Computer-Aided Design Integr. Circuits Syst.,
vol. 32, no. 12, pp. 1613–1624, Dec. 2003.
[17] M. R. Casu and L. Macchiarulo, “Floorplanning for throughput,” in
Proc. ISPD, 2004, pp. 62–69.
[18] H. D. Lin and D. G. Messerschmitt, “Improving the iteration bound of
finite state machines,” in Proc. ISCAS, vol. 3, 1989, pp. 1923–1928.
[19] J. Cong and D. Z. Pan, “Interconnect delay estimation models for syn-
thesis and design planning,” in Proc. ASP-DAC, 1999, pp. 97–100.
[20] C. J. Alpert et al., “Porosity aware buffered steiner tree construction,” in
Proc. ISPD, 2003, pp. 158–165.
[21] H. Zhou, “Efficient steiner tree construction based on spanning graphs,”
IEEE Trans. Computer-Aided Design Integr. Circuits Syst., vol. 23, no.
5, pp. 704–710, May 2004.
[22] F. R. Boyer et al., “Optimal design of synchronous circuits using soft-
ware pipelining techniques,” ACM TODAES, vol. 6, no. 4, pp. 516–532,
2001.
[23] X. Hong et al., “Corner block list: an effective and efficient topological
representation of nonslicing floorplan,” in Proc. ICCAD, Nov. 5–9, 2000,
pp. 8–12.
[24] P.-N. Guo et al., “An O-tree representation of nonslicing floorplan and
its applications,” in Proc. DAC, Jun. 21–25, 1999, pp. 268–273.
[25] H. Murata et al., “VLSI module placement based on rectangle-packing
by the sequence-pair,” IEEE Trans. Computer-Aided Design Integr. Cir-
cuits Syst., vol. 15, no. 12, pp. 1518–1524, Dec.
[26] M. Moe and H. Schmit, “Floorplanning of pipelined array modules using
sequence pairs,” in Proc. ISPD, 2003, pp. 143–150.
[27] C. W. Sham and E. F. Y. Young, “Routability driven floorplanner with
buffer block planning,” in Proc. ISPD, 2002, pp. 50–55.
[28] Y. Ma et al., “An integrated floorplanning with an efficient buffer plan-
ning algorithm,” in Proc. ISPD’03, pp. 136–142.
[29] (2003) GSRC T2 Bookshelf at UC Santa Cruz. [Online] Available: www.
cse.ucsc.edu/research/surf/GSRC/progress.html
[30] Paraquet: Fixed-Outline Floorplanner. [Online] Available: http://vlsicad.
eecs.umich.edu/BK/parquet/
[31] hMETIS: Serial Hypergraph & Circuit Partitioning. [Online] Available:
http://www-users.cs.umn.edu/~karypis/metis/hmetis/
[32] D. Stroobandt et al., “Generating synthetic benchmark circuits for evalu-
ating CAD tools,” IEEE Trans. Computer-Aided Design Integr. Circuits
Syst., vol. 19, no. 9, pp. 1011–1022, Sep. 2000.
Mario R. Casu (M’05) received the Electronics En-
gineer degree (summa cum laude) and the Ph.D. de-
gree from the Politecnico di Torino, Torino, Italy, in
1998 and 2002, respectively.
He is currently an Assistant Professor in the
Department of Electronics, Politecnico di Torino.
In 2001, he was with ST Microelectronics Central
Research and Development, where he worked on
SRAM development and CMOS library characteriza-
tion on a partially depleted 0.13-m SOI technology.
He has coauthored several papers published in
international conferences proceedings and journals. His research interests are
in the field of CMOS circuits modeling and design and interconnect related
circuit, architecture, and computer-aided design-level issues in advanced deep
submicron technologies.
Luca Macchiarulo received the M.S. and Ph.D. de-
grees from the Politecnico di Torino, Torino, Italy, in
1995 and 1999, respectively.
He was a Postdoctoral Researcher at the Univer-
sity of California, Santa Barbara, and the Politecnico
di Torino. He is currently an Assistant Professor in
the Department of Electrical Engineering, University
of Hawaii, Honolulu. His research interests are in
high-speed interconnect design and analysis, archi-
tectural and layout optimization of throughput, field
programmable gate array and regular logic synthesis
and optimization, and physical design issues of ultra-deep-submicron scaling.
