Pipeline synthesis and optimization for reconfigurable custom computing machines by Weinhardt, Markus
Pipeline Synthesis and Optimization
for Recongurable Custom Computing Machines
Markus Weinhardt





This paper presents a pipeline synthesis and optimization technique for high-level language
programming of recongurable Custom Computing Machines. The circuit synthesis generates
hardware accelerators from a sequential program which exploit the recongurable hardware's
parallelism. Program loops are transformed to structural hardware specications. The op-
timization algorithm uses integer linear programming to balance and pipeline the circuit's
registers. This global optimization determines the minimal amount of ip-ops necessary for
an optimal pipeline throughput. It also considers the irregular ip-op distribution on FPGAs.
Standard interface circuitry and a runtime system provide the connection between the accel-
erator unit and its host computer. An integrated compiler invokes the synthesis and produces
a program which downloads, calls and controls its hardware accelerators automatically.
1 Introduction
Recongurable Custom Computing Machines (CCMs) have proven useful for many applications.
They combine the exibility of software with the speed of application-specic hardware. The pro-
gram part executed in software takes advantage of the universality of a general-purpose processor.
Yet portions executed in hardware can be accelerated enormously. However, programming CCMs
remains a dicult task since tradeos between software and hardware must be considered, and the
circuits accelerating the application must be designed manually. This work aims at automatically
extracting and generating accelerators from a sequential (software) program.
The following section presents the components of our high-level CCM compiler. Next, section 3
details the owgraph synthesis which generates circuits from the program's FOR-loops; and section
4 introduces a new optimization algorithm for pipelining. Finally, we discuss related work, report
results, and draw some conclusions.
2 High-level language compilation for CCMs
This section describes the main aspects of our high-level programming approach for CCMs. We
analyze a program written in a sequential programming language (as C or MODULA-2) and
extract hardware accelerators for it. The target architecture is a standard host computer with a
eld-programmable accelerator unit comprising FPGAs and local memory.
This work has been supported by the Deutsche Forschungsgemeinschaft, Graduiertenkolleg \Beherrschbarkeit


















Figure 1: EVC1 with Pipeline Control Unit
We consider FOR-loops as hardware candidates since they express iterative computations. And
computations which perform the same operations on a large set of data are most likely to benet
from hardware acceleration. The loops used for hardware synthesis must be in the following normal
form: The loop bodies may not contain function calls or inner loops, and the FOR-statement must
have the structure FOR I:=0 TO N DO ..., i. e. the loop counter starts with 0 and is always incre-
mented by one. Finally, index expressions in the loop body may not depend on variables dened in
the loop. If not given, this normal form can often be achieved by source language transformations
(function inlining, loop unrolling, induction variable substitution). These transformations are state
of the art [1, 2] and therefore not repeated here.1
The pipeline synthesis and optimization algorithm (detailed in sections 3 and 4) generates ow-
graphs for FOR-loops in normal form. These owgraphs are then instantiated with hardware com-
ponents from an operator library. Using this library, we can also give estimates on the resources
required by a pipeline and on the expected speedup for implementing the loop in hardware. This
information is used by a subsequent partitioning algorithm which automatically determines if a
loop should be executed in software or in hardware. This partitioning can also be performed dy-
namically at run time. Then the actual loop length and the conguration state of the CCM are
considered, too.
To allow automatic operation, the program specic pipelines are controlled by a pipeline control
unit (PCU). Figure 1 shows the PCU in the experimental environment we use, a Sun SPARCstation
with Virtual Computer Corp.'s EVC1 board. The PCU is used for every program and controls the
pipeline operation. I. e. it accesses and stores data, lls and ushes the pipeline and controls the
1Additionally, external operating system or library calls, pointer operations and | due to the limitations of
current FPGA technology | oating-point operations cannot be synthesized to hardware.
2
communication of the accelerator unit with the host computer. The PCU buers input and output
values internally. So it provides more input and output ports to the pipeline than there are local
memory banks. However, due to the sequential memory access, more ports decrease the pipeline
throughput.
Another advantage of the PCU is the fact that it makes the pipeline independent of the actual
hardware and thus facilitates porting the compiler. It also allows the software to call standard
functions for sending and receiving data to and from the local memory and for operating the
pipeline. Together with runtime-system functions which congure and reset the FPGAs, these
functions allow to automatically download and call the accelerators. [3] presents more details on
the PCU and the hardware/software interface.
3 Flowgraph synthesis
This section describes how dataow graphs (or owgraphs for short) are synthesized from a FOR-
loop. The owgraph will be used as a pipeline, i. e. execute the loop's iterations in an overlapped
fashion. Therefore loop-carried dependencies (which restrict the loop's parallelism) have to be
realized as feedback cycles in the owgraph.
3.1 Acyclic owgraph generation
First, we apply compiler optimizations as constant propagation and common subexpression elimi-
nation [1] to the loop body. This reduces the pipeline size. Then we analyze the loop body as if it
was executed only once. A method similar to the transmogrier C compiler tmcc [4] is used. It an-
alyzes the dependencies of the statements and creates a purely combinational, acyclic owgraph.2
Conditional statements (the only control construct allowed in the normal form) are implemented
by multiplexers, and array accesses are treated in the same way as scalar variables.
The owgraph's input nodes have to be initialized with the values valid at the loop entrance, and
the values of output nodes must be read after loop execution. To enable pipeline processing, array
input and output nodes are realized as ports which provide (or process, respectively) the arrays
as sequential data streams.3 So we can treat an index shifted version of a one-dimensional input
array (e. g. X[I-2] with loop variable I) as a delayed version of another access of the same array
(e. g. X[I]). A shift-register (including the intermediate values, here X[I-1]) saves the old input
values and therefore reduces the I/O bandwidth requirements of the owgraph. Additionally, the
direct ow of intermediate values to the next operator saves cycles for storing and loading these
values. Figure 2 shows a loop (which we will use as a running example) and its owgraph. The
shaded nodes represent the input shift-register.
TMP := 0;
FOR I := 0 TO N DO
X[I] := TMP + Y[I+1];









Figure 2: FOR-loop and its owgraph
2In contrast to tmcc, we create a owgraph on the word level rather than the bit level.
3Therefore, we currently only allow array accesses of the form X[I+C] and X[-I+C] (C constant). Arrays must be
read and written in the same direction to allow overlapping of the read, process and write phases. However, we can
treat multi-dimensional arrays if their higher-dimension indices do not depend on the loop variable and have always







Figure 3: Flowgraph with feedback cycle
X[-1] := 0;
FOR I := 0 TO N DO
X[I] := X[I-1] + Y[I+1];





FOR I := 0 TO N DO
X[I] := TMP + Y[I+1];
Y[I+1] := X[I] + Y[I] - Y[I] / 8;
TMP := X[I];
END
Figure 4: Loop with array dependency and its transformation
3.2 Feedback cycles
In order to guarantee correct execution of the pipeline, we have to analyze loop-carried dependencies
and accordingly introduce feedback cycles. These dependencies exist if values dened in a loop
iteration are used in a subsequent iteration. This is the case if a scalar variable is an input and an
output in the owgraph. Then, we add a register to the output node and feed its value back to
the input node. The register holds its initial value only for the rst loop iteration and stores the
feedback value on each successive iteration. Figure 3 shows the owgraph with a feedback cycle
for register TMP (shaded). A multiplexer, along with select and clock enable logic, for choosing and
storing the correct value in the register is necessary, too, but not shown in the gure. Variables
with mutual dependencies may also lead to feedback cycles with more than one register.
A dependency for an array exists if the output node's index is the input node's index incremented
by one (if the array is traversed in increasing order) or decremented by one (if the array is traversed
in decreasing order).4 This is not the case in our example loop. Though array Y would prevent
parallelization, the access order of its elements is correct in a pipeline. So the owgraph of gure
3 is correct. A owgraph with an array feedback cycle is very inecient because it uses only a
few values from the input port. Most of the time the values fed back from the output node are
used. This results in a large waste of I/O bandwidth. Therefore, we perform a high-level loop
transformation which introduces new scalar variables for the few initial values. Then, we can
use the method for scalar feedbacks and save input ports. Figure 4 shows a loop with an array
dependency and its transformation | our example loop. The transformation substitutes the new
scalar variable TMP for X[I-1].
Our detailed analysis decides for all loops if they can be executed by a pipeline or not. However,
as we will see in section 4.1, large feedback cycles will result in poor pipeline throughput. This
can reduce the attainable speedup signicantly.
4 Optimal pipelining
The owgraphs generated in section 3 are not always correct. The registers inserted in the feedback
cycles delay the values on some paths from the input to the output nodes. For correct execution
the delays on all paths must be equalized by register balancing. On the other hand, paths with-
out feedback cycles have a very long combinational delay. They should be pipelined to increase
4Here we consider all the registers of an input shift register as input nodes.
4
throughput. Therefore, we must perform register balancing and pipelining together:
Register insertion problem
Find a correct circuit which allows maximal pipeline throughput using the minimal
number of FPGA ip-ops.
We do not optimize the pipeline's latency because the time for lling and ushing the pipeline
hardly aects the overall performance. And the PCU (cf. section 2) is able to handle varying
latencies automatically.
4.1 Clock period computation
The attainable clock period TC can be determined before performing the pipelining itself. There-
fore, it is used as a xed parameter of the pipelining method.
The clock period TC is determined by the number of input and output ports in most cases. On a
CCM with one memory bank it is computed by
TC := NI  TI +NO  TO
where NI and NO are the numbers of input and output ports, respectively, and TI and TO are the
times for reading and writing a local memory word, respectively. TI and TO depend on the speed
of the used RAM and the available clock frequencies. TI = 50 ns and TO = 100 ns on the EVC1
board we use.
In some cases, very large feedback cycles will increase the required clock period. But it is not
allowed to insert a register in a cycle because it would change the circuit's functionality. However,
we can reduce the clock period by optimally distributing the registers in cycles with two or more
registers.5 We do not consider further high-level optimizatons as those proposed in [5].
Other reasons for an increased clock period are the operator delays. But unless the operators
are part of a feedback cycle, we can always choose pipelined implementations for the operators
themselves. So they will not increase the clock period and thus reduce the throughput.
4.2 ILP for optimal register insertion
Solving an integer linear program (ILP) determines integer variable values which minimize a linear
cost function according to a set of linear constraints (inequalities). The next sections give a new
formalization of the register insertion problem as an ILP. Then, we can use the simplex and branch-
and-bound algorithms to solve this global combinatorial optimization problem eciently.
We formally consider the owgraph G = (N;E) as a set of nodes N and a set of edges E  N N .
Furthermore, I  N is the set of input nodes, O  N the set of output nodes, AI  I the set of
array input nodes, and P  N the set of pseudo operators which contain no logic (e. g. constant
shifts).
4.2.1 Preprocessing
We have to preprocess the owgraph before we can extract constraints for the ILP from it. First,
the feedback cycles are replaced by single supernodes because their registers must remain xed.
This yields a directed acyclic graph (DAG). The node latency NLi of the supernodes is set to the
number of registers in the cycle. We dene NLi = 0 for purely combinational operaters, and NLi
equals the number of internal registers for pipelined operators.




















Figure 5: Preprocessed owgraph
Input values
TC clock period (in ns)
NLi node latency of node i
ELi;j edge latency of edge (i; j)
Wi output width of node i
Ti;j signal propagation time (in ns) from output of node i to output of node j
Computed values
di maximal delay from register to output of node i (in ns)
ri number of registers inserted at output of node i
si number of registers saved by merging with operator i
li latency with respect to array input nodes (in clock cycles)
Table 1: Notation
Next, the shift registers which realize the delayed array inputs have to be removed because they
are subject to optimization. Instead, an edge from the array's input node to the node where the
delayed input was used is added. Its edge latency ELi;j is set to the required number of delays.
ELi;j = 0 for all other edges. Figure 5 shows G for our example. (Assume NL = 0 and EL = 0
unless otherwise stated.)
4.2.2 Notation
Table 1 denes the required notation. TC , NLi and ELi;j have been explained in the previous
sections. The operator's output width Wi is used to determine the precise number of ip-ops
needed, and Ti;j represents the combinational delay of operator j with respect to its input from
node i. For nodes j representing feedback cycles or pipelined operators, Ti;j is the delay from
the input on edge (i; j) to the rst internal register. Since we cannot accurately estimate routing
delays, we add a constant average routing delay to all values Ti;j . To guarantee a working circuit,
Ti;j  TC must hold for all edges (i; j). Figure 5 shows these input values, too.
6
The following computed values are all non-negative integers. di is used as intermediate value to
keep track of the accumulated delay of an operator chain. ri and si count the required registers
and are used in the cost function. Finally, li is the number of registers on a path from an array
input port to node i, i. e. its latency. Because li determines the number of registers inserted on all
paths to node i, it guarantees a balanced pipeline.
6There is only one T value given for every node since the values are the same for all incoming edges.
6
4.2.3 Cost function
The solution of the ILP minimizes the number of inserted ip-ops. Thereby we automatically
unite register chains at dierent outgoing edges of a node. The ip-op count is represented by








The rst term computes the ip-op count of all nodes by summing the products of ri (number of
registers needed at a node i) with the register's width Wi. The second sum computes the ip-ops
which can be saved by merging them with the operator's combinational logic (0  si  1). It
only applies to operators which contain logic but no internal registers. This merging is possible
in FPGA families which combine combinational logic and ip-ops in a logic block. For other
families, we simply omit the second sum.
4.2.4 Constraints
The following contraints dene admissible solutions:
For all input nodes i, the delay from a register is 0:
8i 2 I : di = 0 (1)
For all array input nodes i, the latency from array inputs is 0:
8i 2 AI : li = 0 (2)
All input and output nodes i must be registered:
8i 2 I [O : ri > 0 (3)
The number of registers saved by merging is limited by 1 and the number of registers instantiated
at all:
8i 2 N : si  1; si  ri (4)
The accumulated delay of any node must not exceed the clock period:
8i 2 N : di  TC (5)
The DAG edges order the operator execution: Thus, for all edges (i; j), the latency of node j must
be at least as large as that of node i plus the internal latency of node i:
8(i; j) 2 E : lj  li +NLi (6)
The number of registers at a node's output is determined by its own and its successors' latencies.
Thus, for all edges (i; j), ri is larger than or equal to the dierence of node latencies plus the edge
latency minus the internal latency of node i:
8(i; j) 2 E : ri  lj   li +ELi  NLi (7)
For edges (i; j) with no register inserted (li = lj), the accumulated delay of node j is at least the
sum of that of node i and the propagation time Ti;j . For registered edges, no constraint for dj
applies:7
8(i; j) 2 E : dj  Ti;j + di + TC  (li   lj) (8)
The delay of node j is at least the maximum of its incoming edges' propagation times:
8(i; j) 2 E : dj  Ti;j (9)














































Figure 6: Flowgraph with computed values and pipeline registers
4.2.5 Register insertion
After solving the ILP, we insert registers in the owgraph in the following way: First, add ri
registers to the output of every node i; and next, replace all edges (i; j) by edges from the n-th
register to node j, where n = lj  li+ELi;j NLi. This automatically combines the registers in all
outgoing edges of a node. Figure 6 shows the resulting owgraph. The values were computed using
the mixed IP{solver [6]. In this example, the algorithm actually has to consider the bitwidths of
the operators to decide between inserting a register at the output or at the input of the /8-node.
Otherwise, it could not determine the minimal number of ip-ops.
5 Related work
Some previous work on generating hardware accelerators from a software program has been per-
formed. For example, the PRISM system [7] extracts coprocessors from C functions. However, it
does not exploit hardware parallelism on a higher level as our pipeline synthesis does. On the other
hand, Guccione et al. [8, 9] use vector code to synthesize operator pipelines similar to ours. This
enables more hardware parallelism but requires the programmer to write programs in less general
vector code.
Our loop analysis is similar to methods used in vectorizing and parallelizing compilers [2]. They
allow nested loops with arbitrary strides for the array accesses, but cannot detect all loops which
could be parallelized.
Several methods have been proposed for optimal register insertion and balancing. For example,
[10] discusses pipelining algorithms for vector computers, and [11] proposes an optimal balancing
technique using ILP for data ow computers. But both methods assume a machine architecture
with xed register width and standard delay for all operators. None of these assumptions are true
for pipelines implemented on FPGAs. Therefore we have extended [11] to integrate balancing and
optimal pipelining for FPGAs.
6 Results and Conclusions
We presented new pipeline synthesis and optimization techniques for high-level CCM program-
ming. Experiments with a prototype MODULA-2 compiler have shown the general feasibility of
our compilation approach. Earlier measurements achieved speedups up to 14.1 for a FIR-lter
application, and up to 21.1 for a greyscale-image smoothing application [3]. The experiments com-
pared software runtime on a Sun SPARCstation 10 to hardware conguration and execution time
on an EVC1 board.
8
The example loop of section 3 shows that our dependence analysis handles a considerably larger
class of loops than parallel loops. It automatically synthesizes scan (or prex) operators which
otherwise have to be specied explicitly, e. g. in parallel SIMD languages. Hence, we can synthesize
more ecient hardware accelerators for standard programs.
The ILP formalization introduced here computes the exact number and placement of ip-ops
necessary for optimal throughput of pipelines implemented in FPGAs. Heuristic approaches cannot
do this exactly. This optimization on the ip-op level, rather than on the register level, is especially
necessary for coarse-grained FPGAs which have relatively few ip-ops per combinational gate. It
enables higher resource utilization in high-level CCM programming.
References
[1] A.V. Aho, R. Sethi, and J.D. Ullman. Compilers | Principles, Techniques, and Tools.
Addison-Wesley, 1986.
[2] H. Zima and B. Chapman. Supercompilers for Parallel and Vector Computers. Addison-
Wesley, 1991.
[3] M. Weinhardt. Portable pipeline synthesis for FCCMs. In Field-Programmable Logic and
Applications; 6th International Workshop, pages 1{13. Springer-Verlag, September 1996.
[4] D. Galloway. The transmogrier C hardware description language and compiler for FPGAs.
In P. Athanas and K. L. Pocek, editors, Proceedings of IEEE Workshop on FPGAs for Custom
Computing Machines, pages 136{144, Napa, CA, April 1995.
[5] P. Kogge. The Architecture of Pipelined Computers. McGraw-Hill, 1981.
[6] M. Berkelaar. Unix manual page of lp_solve. Eindhoven University of Technology, Design
Automation Section, 1992.
[7] P. M. Athanas and H. F. Silverman. Processor reconguration through instruction-set meta-
morphosis. Computer, 26(3):11{18, March 1993.
[8] S. Guccione. Programming Fine-Grained Recongurable Architectures. PhD thesis, University
of Texas at Austin, May 1995.
[9] S. A. Guccione and M. J. Gonzalez. A data-parallel programming model for recongurable
architectures. In D. A. Buell and K. L. Pocek, editors, Proceedings of IEEE Workshop on
FPGAs for Custom Computing Machines, pages 79{87, Napa, CA, April 1993.
[10] K. Hwang and Z. Xu. Multipipeline networking for compound vector processing. IEEE
Transactions on Computers, 37:33{47, January 1988.
[11] G. R. Gao. Algorithmic aspects of balancing techniques for pipelined data ow code generation.
Journal of Parallel and Distributed Computing, 6:39{61, 1989.
9
