Fine Grain Incremental Rescheduling via Architectural Retiming&quot;. by Soha Hassoun
Fine Grain Incremental Rescheduling Via Architectural Retiming
Soha Hassoun
Tufts University, Medford, MA 02155
soha@eecs.tufts.edu
Abstract
With the decreasing feature sizes during VLSI fabrication
and the dominance of interconnect delay over that of gates,
control logic and wiring no longer have a negligible impact
on delay and area. The need thus arises for developing tech-
niques and tools to redesign incrementally to eliminate per-
formance bottlenecks. Such a redesign eort corresponds to
incrementally modifying an existing schedule obtained via
high-level synthesis. In this paper we demonstrate that ap-
plying architectural retiming, a technique for pipelining latency-
constrained circuits, results in incrementally modifying an
existing schedule. Architectural retiming reschedules ne grain
operations (ones that have a delay equal to or less than one
clock cycle) to occur in earlier time steps, while modifying
the design to preserve its correctness.
1 Introduction
High-Level Synthesis (HLS), which translates a behavioral
description to a register transfer level (RTL) description,
comprises scheduling and binding. Scheduling determines
the start time of operations corresponding to the functions
in the behavioral description. Latency and/or resource con-
straints shape the resulting schedule. Binding assigns opera-
tions to resources. Scheduling and binding, thus, synthesize
an architecture with dened temporal and spatial proper-
ties. The resulting design is a register-transfer, abstract
structural implementation.
With the decreasing feature sizes during VLSI fabrication
and the dominance of interconnect delay over that of gates,
control logic and wiring no longer have a negligible impact
on delay and area. The need thus arises for (a) better de-
lay and area models during HLS, and (b) the mechanism
for relaying, perhaps dynamically, that information back to
the scheduler. Alternatively, a schedule can be incremen-
tally modied once detailed timing and area information is
available { thus modifying the existing schedule only when
and where necessary.
Incremental rescheduling is the process of modifying an ex-
isting schedule if the initial schedule does not meet its stated
initial goals. Incremental rescheduling is appropriate once
physical pipelining registers, steering logic, wiring delays,
and control logic are appropriately estimated. Only minor
changes are made because the goal is to minimally impact
the existing schedule while respecting the precedence and la-
tency constraints used during initial scheduling. Incremen-
tal rescheduling can be considered as a step in percolation
scheduling [2], which renes an initial schedule in a stepwise
fashion in as allowed by resource and timing constraints.
However, incremental scheduling is applied much later in
the design cycle to a structural description of the circuit.
The term ne grain operation, or operation, in this paper
refers to one that executes during one or less clock cycle. In
ne grain rescheduling, part of a one-cycle operation can be
rescheduled, i.e. moved to earlier or later time steps.
Fine grain incremental rescheduling is equivalent to resyn-
thesizing the circuit portion limiting the performance. Ar-
chitectural retiming [4] is such an optimization technique.
Given an RTL design, architectural retiming removes the
performance bottleneck(s) by rescheduling the operations
along the critical paths between primary inputs and out-
puts, or the critical cycles (paths posing the lowest iterative
bound). We refer to both types of bottlenecks as latency-
constrained paths because no pipelining is allowed to further
reduce the lowest iteration bound.
We investigate in this paper the relationship between ar-
chitectural retiming and ne grain incremental reschedul-
ing. Architectural retiming reschedules operations either
with unlimited resource constraints or with limited addi-
tional recourses. Without any resource constraints, archi-
tectural retiming reschedules an operation earlier in time
(the previous pipeline stage), and synthesizes circuitry to
perform the rescheduled operation quicker than in the orig-
inal schedule. It thus changes the dependencies of an oper-
ation to an earlier time step than that specied in the orig-
inal schedule. This results in performing a precomputation
while allocating the needed additional resources. Without
additional resources, architectural retiming pipelines a sig-
nal and speculatively schedules all its dependent operations.
The only additional resources needed are the ones used to
verify the correctness of the speculative operations and to
recover after erroneous computations. This form of archi-
tectural retiming is referred to as prediction. Our goal in
this paper is to show that architectural retiming is a form of
ne grain rescheduling capable of incrementally modifying
an existing schedule.
We begin this paper by reviewing our model and architec-
tural retiming. Next, we discuss precomputation and predic-
tion. For each, we describe the resulting schedule changes,
we present an example, and we compare our technique with
other high-level synthesis and scheduling approaches. We
conclude by summarizing the contributions of this paper.
2 Model
To describe our algorithms, we use a simple form of con-
trol/data ow graph (CDFG) [2]. Our structural represen-
tation is described based on the following assumptions. All
registers in a circuit are edge-triggered registers clocked by
the same clock. Time is measured in terms of clock cycles
and the notation xt denotes the value of a signal x during
clock cycle t, where t is an integer clock cycle. Values of
signals are referenced after all signals have stabilized and it
is assumed that the clock period is suciently long for this
to happen. A register delays its input signal y by one cycle.
Thus, zt+1 = yt, where z is the register's output.
Each ne grain operation in the design is described using a
single-output function, f , of N input variables (or signals)
x0; x1; : : : ; xN 1 computing a variable y. In a specic cy-
cle t, the variable y is assigned the value computed by the
function f using specic values for x0; x1; : : : ; xN 1, that is,
yt = f(xt0; x
t
1; : : : ; x
t
N 1). Each function may be composed
of Boolean and mathematical operations.
To describe a computation over time, we use a table (or
a schedule). For clarity, arrows showing dependencies are
annotated with function names to denote the dependencies
between signals. The schedule for the circuit path in Fig-
ure 1(a) is shown in Figure 2(a).
3 Architectural Retiming: A Review
Architectural retiming pipelines a latency-constrained path
while preserving a circuit's latency and functionality. La-
tency constraints arise frequently in practice, and they are
due to either cyclic dependencies or explicit performance
constraints. Architectural retiming is comprised of two steps.
First, a register is added to the latency-constrained path.
Second, the circuit is changed to absorb the increased la-
tency caused by the additional register. To preserve the
circuit's latency and functionality, we use the negative reg-
ister concept. A normal register performs a shift forward in
time, while a negative register performs a shift backward in
time. That is, the output of the negative register is com-
puted one cycle before the actual arrival of its input signal.
If z is the negative register's output, and y is the input,
then zt = yt+1. A negative register cancels the eect of the
added pipeline register; the register pair reduces to a wire.
Resulting performance improvements are due to increasing
the number of pipelining stages, and thus clock cycles avail-
able for the computation. This allows for a smaller clock
and improved performance.
Two implementations of the negative register are possible:
precomputation and prediction. In precomputation, the
negative register is synthesized as a function that precom-
putes the input to the added pipeline register using signals
from the previous pipeline stage. In prediction, the nega-
tive register's output is predicted one clock cycle before the
arrival of its input. The predicting negative register is syn-
thesized as a nite state machine capable of predicting new
values, determining mispredictions, and correcting mispre-
dictions. More details about architectural retiming and its
use in both logic and structural architectural synthesis can
be found in [3].
The next two sections investigate how precomputation and
prediction relate to ne grain incremental rescheduling.
4 Incremental Rescheduling Without Re-
source Constraints
4.1 Precomputation
When implementing the negative register added by archi-
tectural retiming as a precomputation, we eectively syn-
thesizes a function f 0 that precomputes the input to the
added pipeline register based on the inputs of the previous
pipeline stage. Precomputation is illustrated along a path,




= f(xt+1) = f(g(mt))
= f 0(m)
The function f 0 is the composition of functions f and g and
evaluates based on the inputs of the previous pipeline stage.
If f 0 can be computed faster than the original composition,
then the total delay along the critical path is reduced, and
the new path can be substituted for the critical one (note
that some nodes along the critical path are retained if needed
by other nodes in the circuit). Retiming can then optimally
place the registers in the circuit. Precomputation thus ex-
poses the concurrency in two adjacent pipeline stages. Note
that precomputation is not possible if the inputs to the cur-
rent pipeline stage are unavailable earlier in time (for exam-
ple, they constitute one of the circuit's primary inputs).
We examine the impact of precomputation on the schedule
(before any retiming). Although the pipeline register delays
a signal by one clock cycle, its input is computed one clock
cycle earlier. Thus, the precomputed signal (the input to the
added pipeline register) is rescheduled one clock cycle earlier
in time. Moreover, the new function f 0 computes this input
based on signals in the same time step which are inputs to
that pipeline stage. The original and modied schedules for
two adjacent time steps t and t+1 for path p are illustrated
in Figure 2.
g f h






zyx u vm n




m g f z zm f’(c)
Figure 1: Path p. (a) Original path. (b) Architecturally
retimed path: a negative register is added followed by a
pipeline register. (c) Precomputation function f 0 imple-
ments.
Precomputation-based architectural retiming was shown to
be eective in improving performance [5]. In addition, pre-
computation results in synthesizing two interesting architec-
tural transformations: bypassing and lookahead.
Bypassing (forwarding) is an architectural optimization tech-
nique often used in processor design to reduce the latencies
associated with writing and then reading the same location

















m            m








x              x
t             t+1      
y              yi+1
i+1
Verification




















t             t+1      









t             t+1      
m            m
n              n
x              x
u              u














u              u







Figure 2: Schedules for path p in Figure 1. The labeled arrows indicate the function that computes the signal at the head of
the arrow. Arrows crossing a time line indicate the presence of a pipeline register. (a) Original schedule. (b) Schedule after
precomputation. (c) Schedule after prediction. The value yi+1 at the output of the negative register, z, in clock cycle t is a
prediction. It is propagated to u in cycle t+ 1, and it is also veried against the true signal yi+1. Assuming it is correct, the
negative register predicts again, the value yi+2. Assuming it is incorrect, in clock cycle t+ 2, signal z is assigned the correct
value. Two clock cycles are then used to produce a correct value in case of a misprediction.
domains to hide register array latencies. By precomputing
the output of an array of registers { representing a RAM,
ROM, FIFO, Register le { results in synthesizing a bypass
transformation and the necessary control logic.
When precomputing in single-register cycles, architectural
retiming results in pipelining this cycle. This is similar to
synthesizing a lookahead transformation on a recursive data
ow graph used in high-level synthesis. The following ex-
ample illustrates this transformation.
4.2 Precomputation Example: GCD
Consider the CDFG for GCD as shown in Figure 3. The cor-
responding RTL implementation is shown in Figure 41. The
reset circuitry and logic to compute done are not drawn for
clarity. The critical cycle posing the lowest iteration bound
involves computing x > y and setting up the multiplexors.
We apply precomputation-based architectural retiming. We
add a negative register followed by a pipeline register to
pipeline the result of the comparison. The output of the
negative register will be labeled m. The input will be labeled
n. Using the denition of the negative register, we compute
m rst in terms of signals from time t+1, and then re-express
these in terms of signals available at time t. Thus,
m
t = nt+1 = (xt+1 > yt+1)
We can evaluate x and y in terms of signals available in an




t+1 = compt?(xt   yt > yt) : (xt > yt   xt)
1A faster implementation of GCD is to concurrently compute x y
and y x and use the most signicant bit of one of the results to select
appropriately the new value. We choose the slower implementation to
demonstrate how both precomputation and prediction can be applied
to a simple example.
Signal compt is the result of comparing xt and yt. A more
ecient implementation of xt+1 > yt+1 can be obtained by
replacing ((xt   yt) > yt) with the most signicant bit of
the operation (2 yt  xt). Similarly, (xt > yt   xt) can be
replaced by the most signicant bit of (yt   2 xt).
Assuming a k bit wide datapath, we can therefore compute
mt+1 as follows:
m
t = compt?(2 yt   xt)[k   1] : (yt   2 xt)[k   1]
The optimized circuit is shown in Figure 6. The original
and modied schedules for GCD is shown in Figure 5(a)
and (b), respectively. Precomputation reschedules the ne
grain operation used to compute comp to an earlier time
step. Combined with the logic in the previous time step,
a new function is synthesized to compute m. The value
of signal comp is thus precomputed one clock cycle earlier
than in the original schedule. Using SIS [14] to optimize and
retime the circuit, we computed a speed up of 35% at an
area increase of 63%, which is result of the increase in logic
area by 49% and registers by 256%. Additional performance
improvements are possible by reapplying precomputation to
precompute the signal new.
4.3 Related Work
Precomputation overlaps with several existing techniques
in high-level synthesis. The eect of using lookahead to
pipeline cycles (loops with feedback) was previously dis-
cussed by Kogge [8]. He transforms a recurrence equa-
tion x(n) that originally depends on the previous sequence,
x(n  1), and an external input a(n), to a recurrence equa-
tion that depends on an earlier recurrence { x(n  i), a(n),
and b(i) { a collection of terms provided as inputs to the



















































n   =? 1 n      =? 1i+1






i+1n           





n           
(c)
(b)
Figure 5: Execution of GCD. (a) Initial schedule: new is computed based on x and y available in the same clock cycle. (b)
Precomputation schedule: comp is computed based on signals from the previous iteration. Note that the values of x and y
from the previous pipeline stage are eectively x and y of the previous iteration. (c) Prediction schedule. The output of the
















Figure 3: Control Data Flow Graph for GCD. The dashed
arrows are conditions. The solid arrows are dependencies.
The thick arrows at the bottom indicate crossing an iteration
boundary.
ceptually unrolls the recurrence i times to allow the cor-
responding cyclic pipeline to complete one operation each
cycle. Lookahead of iterative DSP data-ow graphs is also
used in optimizing quantizer loops to achieve the smallest
possible iteration bound [11, 10, 1].
Precomputation-based architectural retiming, however, is
distinct in its approach for several reasons. First, it performs
rescheduling and resynthesis at a ne grain level. Second,
it applies to any path or \piece of code", and not only to
recurrence equations. Third, precomputation manipulates
functions (operations) that have both Boolean and mathe-
matical identities. Thus, control and data can be concur-
rently synthesized and optimized, leading to more interest-
ing optimizations when compared to using techniques that
specically target each domain separately (for example, tree
height reduction and constant propagation are used specif-
ically to optimize data ow graphs). Fourth, when applied
to the output of a register array, precomputation discovers
a bypass transformation { an important technique whose
synthesis has not been previously addressed. Finally, by


















Figure 4: Initial implementation for GCD. Every clock cy-
cle, either x or y is updated with the value of new. When
architectural retiming is applied, the two registers inside the
dotted box are added.
ters, the denition of precomputation can be easily modied
to extend precomputation across several clock cycles. Ar-
chitectural retiming's denition of precomputation is thus
exible and general.
5 Incremental Rescheduling With Resource
Constraints
5.1 Prediction
Prediction-based architectural retiming delays a signal while
speculatively executing all its dependent operations. The
delayed signal is veried one clock cycle later. Execution
continues if the prediction is correct; otherwise, correction
















Figure 6: GCD implementation using precomputation. The
critical cycle has roughly the same delay as the original cir-
cuit, but there are now two pipeline registers along that
cycle, thus lowering the iteration bound.
path in Figure 1.
There are three issues in synthesizing prediction. First, the
circuit must be modied to return to correct normal opera-
tion after a misprediction. A misprediction produces invalid
(incorrect) data. This invalid data may propagate in the cir-
cuit aecting the next iterations of the computation. In our
work, we developed two correction mechanisms, As-Soon-
As-Possible restoration and As-Late-As-Possible correction.
The rst forces the circuit to return to normal operation by
restoring the circuit's state to that of the previous clock cy-
cle and re-executing the operations considering the correct,
yet delayed, signal. The second mechanism, allows invalid
data to propagate freely in the circuit, while making each
node responsible for generating only one piece of invalid data
for every misprediction in the circuit. The implementation
of these two correction strategies generate varying and in-
teresting architectures.
The second issue in synthesizing prediction is changing the
circuit's I/O interface. Because two clock cycles are re-
quired to generate a new value in case of a misprediction,
prediction-based architectural retiming produces a variable
latency circuit. That is, the rate of consuming inputs and
producing outputs is no longer a constant. To make the
interface circuitry aware of this, handshaking signals that
acknowledge consuming new inputs and signal invalid out-
put data must be added to the circuit.
The nal issue is deciding on a value to predict in order to
minimize the frequency of mispredictions. In our work, we
synthesize the negative register as a predicting nite state
machine that uses transition probabilities provided by the
designer to guide the predictions. We next examine how to
apply prediction to the GCD example.
5.2 Prediction Example: GCD
We applied prediction-based architectural retiming to the
GCD architecture shown in Figure 4 by adding the pipeline
and negative register to signal comp. To compensate for
the added pipeline register, operations dependent on comp
are eectively rescheduled to execute speculatively. Assum-
ing an equal likelihood of the signal being true or false, we
randomly choose to predict it always true.
The nal resynthesized circuit using the As-Soon-As-Possible
correction is shown in Figure 7. A nite state machine ei-
ther generates a new prediction (m is set to true) or corrects
a previous guess (m is set to false) every clock cycle. If a
misprediction is not detected in a specic clock cycle, then
m is predicted to be true. The circuit will compute the op-
eration x y in the following clock cycle. A misprediction is
detected if n is false while m was predicted true in the previ-
ous cycle. In this case, the two added multiplexors allow the
circuit to return to correct execution in the cycle following
a misprediction, in which case the circuit computes y   x.
The circuit's interface need not be modied because signal
done is modied to never assert in the clock cycle during
which misprediction is detected. Using SIS to evaluate the
synthesized circuit, the speed up was 17% assuming a 50%
prediction accuracy at an area increase of 92%.
The modied schedule is shown in Figure 5(c). In clock
cycle t + 2, a misprediction is detected if the value of ni+1
is false. Signal m, the output of the FSM, is set false. The
misprediction signal is asserted and the added multiplexors
select the old values for x and y. In clock cycle t + 3, the
circuit executes the operation y x based on previous values

























Figure 7: Using ASAP restoration correction strategy when
applying prediction-based architectural retiming to restore
the state of the circuit in case of a misprediction.
5.3 Related Work
Speculative execution can be classied according to its scope
and nesting level. The scope (breadth) can span all possible
execution paths, some of them, or it can be along a single
path. Speculating along one or two execution paths could
be reasonable both in software and in hardware; however,
speculating along multiple paths results in large hardware
implementations. The speculative execution's nesting level
(depth or extent) varies: allowing either a single outstand-
ing speculation during active execution, or multiple ones.
Speculation lasts only until a condition is resolved. Thus,
in practice, the nesting level is limited.
Using speculative approaches in non-processor domains have
been recently investigated by the high-level synthesis (HLS)
community. Scheduling algorithms consider a variety of
scoping and nesting levels.
Wakabayashi and Tanaka propose a scheduling algorithm to
parallelize multiple nests of conditional branches; however,
they only don't consider loop dependencies [15]. Holtmann
and Ernst describe four examples for which they explore ap-
plying a technique modeled on multiple branch prediction in
a processor [6]. Their methodology is to ignore control de-
pendencies during scheduling and then add register sets to
restore the state in case of prediction error. ASAP restora-
tion associated with prediction is similar to their correction
mechanisms, however, since architectural retiming is based
on optimizing the RTL description, knowledge of exact cir-
cuit topology and connectivity allows more exibility. For
example, in ASAP restoration, architectural retiming does
not restore the state of register les, but rather synthesizes
bypass circuitry to delay updating register arrays instead of
restoring them in case of misprediction.
Holtmann and Ernst's technique is applied to a program
description and, thus, it predicts only explicit if-then con-
trol points in the description. In their follow-up work, they
present a detailed scheduling algorithm [7] to be used in
high-level synthesis. The scheduling algorithm performs spec-
ulation along the most probable path. Radivojevic et al. de-
scribe a scheduling technique that employs pre-execution [13].
All operations possible after a branch point are precomputed
before the branch condition is determined. Once the branch
condition is known, one of the pre-executed operations is
selected. Lakshminarayana et al. recently described how to
incorporate speculative execution in a generic schedule [9].
They allow speculative execution along multiple paths, and
arbitrarily deep into nested branches.
Prediction-based architectural retiming is dierent than these
high-level scheduling approaches because it could be applied
along any signal along the critical path if the misprediction
frequencies are minimal (i.e. when clever predictions are pos-
sible). This is in contrast to related work that just specu-
lates based on conditionals. By examining speculation at the
structural level, we consider changes necessary to deal with
the variable latency. Our two correction strategies explore
two dierent methodologies for using additional multiplex-
ors and registers to return a circuit to correct operation after
mispredictions. Finally, prediction is not limited to one nest-
ing level. By inserting multiple pairs of negative/pipeline
registers we can eectively increase the nesting level { that
is, we could allow multiple outstanding mispredictions.
6 Conclusion
This paper has two main contributions. First, it demon-
strates that architectural retiming is capable of performing
ne grain incremental rescheduling of operations along crit-
ical paths. Incremental rescheduling will become more im-
portant as deep submicron issues become more prevalent.
Incremental rescheduling, and thus RTL optimizations, will
become a middle ground between HLS and logic synthesis.
Second, it explores in detail the relationship between archi-
tectural retiming and related work in high-level scheduling
and synthesis. Synthesis algorithms that implement archi-
tectural retiming can be found in [3].
The techniques synthesized by architectural retiming are not
new; however, architectural retiming uses the conceptual
pipelining with negative and pipeline registers to unify and
generalize a few techniques such as some forms of specu-
lative execution, lookahead and bypassing. The approach
proposed is refreshing and oers a novel way of performing
incremental scheduling of ne grain operations.
REFERENCES
[1] F. Catthoor, W. Geurts, and H. D. Man. \Loop Transformation
Methodology for Fixed-rate Video Image and Telecom Process-
ing Applications". In Proceedings of the International Confer-
ence of Application-Specic Processors, pages 427{38, 1994.
[2] G. De Micheli. Synthesis and Optimization of Digital Circuits.
McGraw-Hill, 1994.
[3] S. Hassoun. \Architectural Retiming: A Technique for Opti-
mizing Latency-Cosntrained Circuits". PhD thesis, University
of Washington, December 1997.
[4] S. Hassoun and C. Ebeling. \Architectural Retiming: Pipelining
Latency-Constrained Circuits". In Proc. of ACM-IEEE Design
Automation Conf., pages 708{13, June 1996.
[5] S. Hassoun and C. Ebeling. \Using Precomputation in Architec-
ture and Logic Resynthesis". In Proc. of the 1998 International
Conference on Computer-Aided Design, 1998.
[6] U. Holtmann and R. Ernst. \Experiments with Low-Level Spec-
ulative Computation Based on Multiple Branch Prediction".
IEEE Transactions on VLSI Systems, 1(3):262{267, Septem-
ber 1993.
[7] U. Holtmann and R. Ernst. \Combining MBP-Speculative Com-
putation and Loop Pipelining in High-Level Synthesis". In Proc.
European Design Automation Conf., pages 550{6, 1995.
[8] P. Kogge. The Architecture of Pipelined Computers. McGraw-
Hill, 1981.
[9] G. Lakshminarayana, A. Raghunathan, and N. Jha. \Incorpo-
rating Speculative Exeuction into Scheduling of Control-Flow
Intensive Behavioral Descriptions". Proc. ACM-IEEE Design
Automation Conf., pages 108{13, June 1998.
[10] K. Parhi. \Algorithm Transformation Techniques for Concur-
rent Processors". Proceedings of the IEEE, 77(12):1879 { 95,
December 1989.
[11] K. Parhi. \Look-ahead in Dynamic Programming and Quantizer
Loops". In IEEE International Symposium on Circuits and
Systems, pages 1382{7, 1989.
[12] D. Patterson and J. Hennessy. \Computer Architecture : A
Quantitative Approach". Morgan Kaufmann Publishers, 1990.
[13] I. Radivojevic and F. Brewer. \Incorporating Speculative Exe-
cution in Exact Control-Dependent Scheduling". In Proc. 31th
ACM-IEEE Design Automation Conf., pages 479 {484, 1994.
[14] E. Sentovich, K. Singh, L. Lavagno, C. Moon, R. Murgai, A. Sal-
danha, H. Savoj, P. Stephan, R. Brayton, and A. Sangiovanni-
Vincentelli. \SIS: A System for Sequential Circuit Synthesis".
Technical Report UCB/ERL M92/41, University of California,
Dept. of Electrical Engineering and Computer Science, May
1992.
[15] K. Wakabayashi and H. Tanaka. \Global Scheduling Independent
of Control Dependencies Based on Condition Vectors". Proc.
ACM-IEEE Design Automation Conf., pages 112{5, June 1992.
