ImprovingBoundsforFPGALogicMinimization TimTodman,HaofanFu,OskarMencerandWayneLuk DepartmentofComputing,ImperialCollegeofScience,TechnologyandMedicine by The Pennsylvania State University CiteSeerX Archives
Improving Bounds for FPGA Logic Minimization
Tim Todman, Haofan Fu, Oskar Mencer and Wayne Luk
Department of Computing, Imperial College of Science, Technology and Medicine
Imperial College
180 Queen’s Gate, London SW7 2BZ, England
Abstract—We present a methodology for improving the bounds
of combinational designs implemented on networks of lookup
tables, moving them closer to the theoretical minimum. Our work
effectively extends optimality to span logic minimization and tech-
nology mapping. We obtain a proof of optimality by restricting
ourselves to 4-input look-up tables (LUTs) and generating all
possible circuits up to a certain area or latency depending on
the optimization mode. Since simple-minded generation would
take a long time, we develop levels of abstraction (steps) and
techniques to restrict and order the search space, and produce
results in practical time. We use logic decomposition to break up
large designs, using the resulting trees to guide our search and
prune the search space. The price of this optimality is that we
are limited to small blocks; however, such blocks can be used to
build larger designs.
I. INTRODUCTION
We address the problem of identifying minimal circuits
for a function by improving the upper and lower bounds of
resources it can use. We ﬁnd the lower bound using global
generation: in principle, generating every possible conﬁgura-
tion of a device. In pratice, we use local symmetries to give the
effect of exhaustive generation at reduced cost, using Field-
Programmable Gate Array devices (FPGAs) for high-speed
emulation of conﬁgurations and connections of look-up tables
(LUTs). By searching the space of LUT conﬁgurations and
interconnections directly, we combine logic minimisation and
technology mapping from Boolean functions to LUTs. We ﬁnd
the upper bound using logic decomposition, applying local
generation on the components of the decomposition. If we
are lucky, global generation uncovers the minimum possible
implementation. Otherwise, we get an improved measure of
the bounds within which the optimal design must lie, as well
as a locally optimized implementation.
Our main contributions in this paper are to:
• Improve the measure of the bounds for optimal solutions
• Build a framework for circuit generation combining logic
minimization and technology mapping
• Use logic decomposition to guide search, pruning the
search space and giving the upper bound
• Evaluate our techniques on standard benchmarks
The rest of this paper is structured as follows: Section II
gives an overview of our approach, comparing it with related
work. Section III shows our circuit generation framework
combining logic minimization and technology mapping. Sec-
tion IV shows parallel hardware for generating circuits on
FPGAs. Section V uses logic decomposition to guide and











Input: function Y to be optimized,










Fig. 1. Improving bounds by generation and decomposition: (a) Process
input (logic function) and outputs (upper and lower bounds). (b) Starting
with the initial maximum max and minimum min number of LUTs, global
circuit generation increases the lower bound, while decomposition and local
generation reduce the upper. Generation is parallelizable, so multiple FPGAs
can be used for generation, allowing a higher lower bound by generating more
circuits in a reasonable time. Ultimately we ﬁnd either the absolute minimum
circuit by global generation, or new, tighter bounds within which it must lie.
results and evaluates the use of logic decomposition in our
framework for logic minimization and technology mapping.
Finally, Section VII concludes and suggests future work.
II. OVERVIEW AND RELATED WORK
Our approach improves the initial lower and upper bounds
of the number of LUTs required to implement a given logic
function. (ﬁg. 1), using circuit generation and logic decompo-
sition: global circuit generation, on the entire design, improves
the minimum; local generation, on the parts of the decomposed
design, improves the maximum.
We break generation into four steps (ﬁg. 2). We implement
step 4 in parallel hardware on FPGAs, relying on two key
FPGA properties: (a) LUTs: high-speed table look-up. (b)
Massive parallelism: many instances in parallel. We limit
ourselves to single output functions; generation for multiple
output functions is impractical.
Early works on area minimization decompose the circuit
into a set of trees, and apply technology mapping on tree
structures [1], [2]. Cong et al. concentrate on enumeration
of single output, K-input connected subgraphs (fanout-free
cones) within the circuit, and prove that the problem can still
be optimally solved by decomposing the circuit into maximal
fanout-free cones (MFFC), and enumerating separately on
each MFFC [3]. The proposed algorithm restricts the solutionStep 1
Table I: get range of latencies or areas
Step 2a (optimise for Latency)
Find shapes for each latency, 
sort by latency, then by 
generation effort
Step 2b (optimise for Area)
Find shapes for each area,
 sort by area, then  by 
generation effort
Step 3
(A) Generate each connection for each shape
(B) Generate each graph for each connection
Step 4
For each graph, for all inputs, generate all LUT configurations
Input: function Y to be optimised, 
goal (latency or area)
Output: circuit graph of optimised design
Fig. 2. Circuit generation. Step 2 differs for area (step 2a) and latency (step
2b). Step 4 can run in software or parallel hardware (section IV).
to duplication-free mappings where each circuit gate must be
mapped to exactly one LUT. Later work by Cong et al. [4]
introduces heuristics to reduce the runtime, and extends the
approach to duplicable mappings.
More recently, Ling et al [5] reformulated the technology
mapping problem as a Boolean satisﬁability (SAT) problem,
showing that state-of-the-art FPGA technology mapping al-
gorithms miss optimal solutions. They also created an algo-
rithm solving the optimal area mapping problem. Safarpour et
al. [8] decompose the resulting SAT problem into two easier
problems to increase efﬁciency. Cong et al. [9] derive their
SAT formulation from the implicant rather than the minterm
representation of the problem, creating a smaller problem
which can be solved faster and cover more target problems.
Two recent efforts using enumeration concern an implicit
technique for enumerating structural choices in circuit opti-
mization based on rewiring and resubstitution [6], and the
adoption of reverse search in enumerative optimization for
obtaining, for instance, the k shortest Euclidean spanning
trees [7]. Our research complements this work, since we
exploit circuit parallelism to speed up generation.
III. FRAMEWORK
This section shows our circuit generation framework’s four-
step approach, developing expressions for the upper and lower
bound sizes of mappings from function to graph of LUTs.
Fig. 2 shows how we break the problem into four steps: Step 1:
given an N-bit input, 1-bit output boolean input function Y and
an optimization mode (area or latency), identify observable
inputs and limit the search space. Step 2: generate all circuit
shapes (vectors of the numbers of LUTs in each layer) within
the search space from step 1; sort by (a) latency or (b) area.
Step 3: generate all possible interconnections for each shape,
Step 4: generate all possible LUT conﬁgurations for each
circuit graph. We generate graphs of 4-input LUTs, with H
layers of LUTs, where layer h has Lh LUTs; Ltot LUTs in
total.
Logic functions with more than four inputs require multiple
LUTs. We further reﬁne the steps of ﬁg. 2 for N-input logic
functions.
TABLE I
STEP 1: THEORETICAL UPPER AND LOWER BOUNDS FOR LATENCY
(MAXIMAL DEPTH OF LUTS FROM INPUTS TO OUTPUT) AND AREA
(NUMBER OF LUTS) FOR VARIOUS NUMBERS OF INPUTS.
function optimize for latency optimize for area
#inputs min max min max
≤ 4 1 1 1 1
5 2 2 2 3
N log4(N) (N − 3) ⌊(N + 1)/3⌋ 2N−3 − 1
O(logN) O(N) O(N) O(2N)
Step 1 Count observable inputs, index into table I to ﬁnd
the area or latency bounds. We deﬁne latency as the maximal
depth in LUTs from design inputs to design output, and area as
the total number of LUTs. We calculate the initial upper bound
by observing that an n + 1-input design can be implemented
using two n-input LUTs multipexed by an additional LUT
controlled by the n + 1th input.
Three rules facilitate calculation of minumum area and la-
tency required: (1) each observable design input must connect
to at least one LUT input, (2) at least one of the LUT inputs
must connect to a LUT output at a previous layer, (3) there is
a single LUT at the highest layer.
These rules ensure that (1) no input is redundant, (2) no
LUT is disconnected (redundant) and (3) there is only one
design output.
Step 2 Find all shapes for the bounds from step 1 (table II).
Sort the resulting list of shapes by latency (if optimizing for
latency, step 2a) or area (if optimizing for area, step 2b).
Within the sorted list, sort equal-area (step 2a) or equal-latency
(step 2b) shapes by generation effort: order by size of search
space in steps 3 and 4. For example, for an 7-input design for
minimum area, ﬁrst choose the smallest shape that will accept
seven inputs: (1,1) in our terminology. If this fails, choose
the next smallest shape: (2,1). Similarly, ﬁnd the minimum
latency design by iterating from the minimum latency topology
to the maximum. We observe that the number of shapes for





. Thus the total number of
shapes is bounded by 2Ltot [10], and the total number of
shapes for the bounds of areas from step 1 is bounded by:
P2
N−3−1
⌊(N+1)/3⌋ 2Ltot = 22
N−3
− 2⌊(N+1)/3⌋
Step 3 Generate all interconnections. Step 3(A): produce a
set of connections for each shape: topologically distinct trees
where the output of each LUT in a layer must connect to
the input of a LUT in the next layer. Step 3(B): generate
directed acyclic graphs for each connection: all combinations
of connections from each LUT input unconnected in step 3(A)
to each LUT output in previous layers, and the design inputs.
For a LUT at layer h, the number of possible interconnections
is: Lh−1 ∗ (N +
Ph−1
0 Li)3.
Step 4 For all graphs, generate each conﬁguration of each
LUT, for each input. The output of the ﬁnal circuit must match
Y for each input over the input space of 2N. We use parallel
hardware to speed this step, shown in the next section.TABLE II
STEP 2: ALL THE DIFFERENT SHAPES FOR ONE TO THREE 4-LUTS
Latency











































Fig. 3. (a) Hardware for parallel generation of shape (1,1), with I = 2 input
vectors in parallel. For each input vector, we replicate the target hardware
and emulation of LUT 1, and LUT 0 for each of p different conﬁgurations.
We use this design for both breadth- and depth-ﬁrst generation. Dotted lines
indicate hardware omitted for clarity.
IV. GENERATION CIRCUIT GENERATION
This section shows our designs for implementing step 4 of
ﬁgure 2 (generating all LUT conﬁgurations for each graph, for
all inputs) by parallel generation on reconﬁgurable hardware.
We build FPGA circuits using ASC, A Stream Com-
piler [11]. For our implementation, this means we can write
low-level optimizations and high-level structure all within the
same C++ description. We build one ASC design per shape:
Step 4 Generate an ASC circuit for each graph output from
step 3(B). Instantiate the target hardware, datapath containing
LUT emulators and comparators, and a ﬁnite state machine
to loop through each input until the ﬁrst failing one, for each
conﬁguration, stopping at the ﬁrst conﬁguration that matches
the target Y output for each input. We emulate LUTs, rather
than use FPGA LUTs directly, to avoid reconﬁguring the
design for each set of LUT conﬁgurations.
Fig. 3 shows the datapath our parallel generation hardware,
which we use for both depth-ﬁrst and breadth-ﬁrst approaches.
The difference is in the state machine driving the datapath:
depth-ﬁrst tries each input until the ﬁrst failing one; breadth-
ﬁrst tries only a small set of inputs. In this design, failing
means that none of the conﬁgurations of LUT0 match the the
target (output of Y) for all the parallel inputs.
Mapping to Xilinx LUTs. Part of the above design can map
explicitly to Xilinx Virtex II CLB resources – similar tech-
niques can apply to other FPGA families. Our hardware design
has two properties: (a) for p LUTs emulated in parallel, each
parallel conﬁguration for LUT 0 lies in the same arithmetic
sequence c..c + p, (b) thus the log2(p) least-signiﬁcant bits
of each conﬁguration are constant, and can be emulated with
ROMs.
V. LOGIC DECOMPOSITION
This section shows how we use logic decomposition to im-
prove the measure of the upper bound number of LUTs needed


























Fig. 4. Using logic decomposition: motivating example. One benchmark (a)
is decomposed into a ﬁve-input function and a NOR gate with input labelled x
(b). We show that only designs (c) and (d) need be considered, a considerably
smaller search space than for a general six-input function, and for LUT 0, we
need only generate three-input function F.
Logic decomposition takes a circuit and returns a collection
of subcircuits and their connections, which, when composed
together, give the same output as the input circuit.
To show the potential beneﬁts of logic decomposition, con-
sider a small example: output 14 of ISCAS benchmark s298.
This has six observable inputs: too large to generate on a single
CPU or FPGA. The total search space for a six-input function
is of order O(2128), using up to seven LUTs. Figure 4 shows
the results of decomposing this design (a) into (b): a two-input
NOR gate and a ﬁve-input prime (non-decomposable) block.
After decomposition, we can reduce the search space to (c) and
(d) : a ﬁve-input function takes at most three LUTs (d), and
this design can implement the NOR in LUT 0. Three LUTs is a
signiﬁcant search-space reduction compared to seven without
decomposition. Furthermore, because part of the function of
LUT 0 is now ﬁxed, its search space reduces to a three-input
function F (d) (search space size 22
3
= 28, compared to
216 for a 4-input function). The total search-space reduction
is thus 216/28 = 256. Also, the bounds improve from 2..3
(latency) and 2..7 (area) to 2..2 (latency) and 2..3 (area).
Logic decomposition improves the upper bound, generating
each subsearch separately. Although the overall result is no
longer optimal, each generated subcircuit remains optimal.
Because circuit generation takes time exponential in the
number of design inputs, we choose a disjoint decomposition
method, so the decomposed functions have no common inputs;
speciﬁcally, we use Plaza and Bertacco’s STACCATO method
and software [12]. Staccato decomposes a logic function
into a tree of subfunctions, each with disjoint inputs. Each
subfunction is either associative (AND, OR, XOR), or a prime
function – one that cannot be decomposed further.
Our approach divides into four steps: (1) apply logic de-
composition to design, (2) traverse the decomposition tree,
separating out the prime (non-decomposable) blocks, (3) gen-
erate each prime block, (4) build the output hardware from
the generated blocks. Step 3 applies the generation techniquesTABLE III
RANGE IMPROVEMENT. SHOWS NUMBER OF OBSERVABLE INPUTS, MINIMAL SHAPE FOUND AND RESULTS FROM XILINX XSTV8.1 AND FOR DAOMAP
AND FLOWMAP, USING THE RASP PACKAGE FROM UCLA [13].
Name Output #Obs. #Shapes Shape #LUTs Area bounds Latency bounds
Inputs XST DAOmap FlowMap (old) (imp.) (old) (imp.)
s27 1 5 2 (1,1) 2 5 5 2..3 2..2 2..2 2..2
2 5 2 (1,1) 2 2 2 2..3 2..2 2..2 2..2
3 5 2 (1,1) 2 5 5 2..3 2..2 2..2 2..2
s298 8 5 2 (1,1) 2 3 3 2..3 2..2 2..2 2..2
10 5 2 (1,1) 2 3 3 2..3 2..2 2..2 2..2
12 7 268 (2,1) 3 5 5 2..15 3..3 2..2 3..3
14 6 18 (2,1) 3 3 3 2..7 3..3 2..2 3..3
b01 4 5 2 (1,1) 2 3 3 2..3 2..2 2..2 2..2
5 5 2 (1,1) 2 3 3 2..3 2..2 2..2 2..2
6 5 2 (1,1) 2 3 3 2..3 2..2 2..2 2..2
7 5 2 (1,1) 2 2 3 2..3 2..2 2..2 2..2
designed in the rest of this paper, using the global optimization
goal (latency or area). Note that the decomposed prime blocks
may still have too many inputs to practically generate; these
cases must rely on conventional tools for optimization.
VI. RESULTS AND EVALUATION
This section shows results for software and hardware gen-
eration for several ISCAS benchmarks, showing original and
improved bounds achieved.
Table III shows benchmarks chosen from the standard
ISCAS 85, 89 and 99 sets, with bounds of LUTs for area
and latencies – these worst-case results are the initial upper
and lower bounds from table I. The XST, DAOmap [15] and
FlowMap [14] results are for each output individually – we
remove hardware for other outputs. The bounds improvement
results for these benchmarks show runtime and minimal shapes
found (software results run on an Intel Xeon 2Ghz processor).
Generation times vary up to an order of magnitude. All the
software generation results correspond to an generation rate of
roughly 4.8 × 106 conﬁgurations per second, about 20% the
rate of our hardware. Hardware generation runs on a single
Xilinx XC2V6000 FPGA (Celoxica RC2000 board).
VII. CONCLUSION
We show a methodology for optimising circuits for FPGA
implementation that combines logic minimization and technol-
ogy mapping. We develop a four-step process to give the effect
of generating all possible circuits ordered by user optimization
goal: latency or area. Our reconﬁgurable hardware implemen-
tation speeds this process by rapidly ﬁnding which generated
circuits match the target design. We use logic decomposition
to guide and speed our search process, eliminating searches
using the resulting decomposition tree. Although our approach
is only globally optimal for small designs, it is still locally
optimal for larger designs, and can be applied to building
blocks of larger designs.
Current and future work includes porting generation to a
large multiple-FPGA machine. This is ideal for generation as
many generators can run in parallel across multiple FPGAs.
We would also like to extend generation to cover multiple-
output designs, sequential designs and other design elements
beyond LUTs. Our ultimate goal is to subsume many tradi-
tionally separate optimization steps into one generation step,
with results guaranteed to be optimal.
REFERENCES
[1] K. Keutzer. “DAGON: Technology Binding and Local Optimization by
DAG Matching”. In Proc. DAC 1987, pages 341–347, 1987
[2] R. Francis, J. Rose, and Z. Vranesic. “Chortle-crf: Fast Technology
Mapping for Lookup Table-Based FPGAs”. In Proc. DAC 1991, pages
227–233, 1991.
[3] J. Cong, and Y. Ding. “On area/depth trade-off in LUT-based FPGA
technology mapping”. In Proc. of DAC 1993, pages 213–218, 1993.
[4] J. Cong, C. Wu, and Y. Ding. “Cut ranking and pruning: enabling a
general and efﬁcient FPGA mapping solution”. In Proc. FPGA 1999,
pages 29–35, 1999.
[5] A. Ling, D.P. Singh and S.P. Brown, “FPGA Technology Mapping: A
Study of Optimality”, In Proceedings of DAC 2005, IEEE, 2005.
[6] V.N. Kravets and P. Kudva, “Implicit Enumeration of Structural Changes
in Circuit Optimization”, Proc. DAC, pp. 438–441, 2004.
[7] J. Nievergelt, “Exhaustive Search, Combinatorial Optimization and Enu-
meration: Exploring the Potential of Raw Computing Power”, Proc. Conf.
on Current Trends in Theory and Practice of Informatics, LNCS 1963,
pp. 18–35, 2000.
[8] S. Safarpour, A. Veneris, G. Baeckler and R. Yuan, “Efﬁcient SAT-based
Boolean Matching for FPGA Technology Mapping”, in Proc. Design
Automation Conference, IEEE, 2006.
[9] J. Cong and K. Minkovich, “Improved SAT-based Boolean Matching
Using Implicants for LUT-based FPGAs”, in FPGA ’07, ACM, 2007.
[10] Ronald L Graham, Donald E Knuth, Oren Patashnik, Concrete Mathe-
matics: A Foundation for Computer Science, Addison-Wesley, 1989.
[11] Oskar Mencer, “ASC, A Stream Compiler for Computing with FPGAs”
IEEE Transactions on CAD, IEEE, 2006.
[12] S. Plaza and V. Bertacco. “STACCATO: Disjoint Support Decomposi-
tions from BDDs through Symbolic Kernels”. In Proceedings Asia South
Paciﬁc Design Conference, 2005.
[13] UCLA VLSI CAD lab., RASP – LUT-Based FPGA Technology Map-
ping Package, release B1.1, at http://cadlab.cs.ucla.edu/
software release/rasp/htdocs/ .
[14] J. Cong and Y. Ding, “FlowMap: An Optimal Technology Mapping Al-
gorithm for Delay Optimization in Lookup-table Based FPGA Designs”,
in IEEE Trans. on CAD of ICs and Systems 13:1, IEEE, Jan. 1994.
[15] D. Chen, and J. Cong,“DAOmap : A Depth-Optimal Area Optimization
Mapping Algorithm for FPGA Designs”, in Proc. ICCAD, pp. 752-759,
Nov. 2004.