Beyond the arithmetic constraint: depth-optimal mapping of logic chains in LUT-based FPGAs by Frederick, Michael & Somani, Arun K.
Electrical and Computer Engineering 
Conference Papers, Posters and Presentations Electrical and Computer Engineering 
2-2008 
Beyond the arithmetic constraint: depth-optimal mapping of logic 
chains in LUT-based FPGAs 
Michael Frederick 
Iowa State University 
Arun K. Somani 
Iowa State University, arun@iastate.edu 
Follow this and additional works at: https://lib.dr.iastate.edu/ece_conf 
 Part of the Computer and Systems Architecture Commons, and the Systems and Communications 
Commons 
Recommended Citation 
Frederick, Michael and Somani, Arun K., "Beyond the arithmetic constraint: depth-optimal mapping of 
logic chains in LUT-based FPGAs" (2008). Electrical and Computer Engineering Conference Papers, 
Posters and Presentations. 120. 
https://lib.dr.iastate.edu/ece_conf/120 
This Conference Proceeding is brought to you for free and open access by the Electrical and Computer Engineering 
at Iowa State University Digital Repository. It has been accepted for inclusion in Electrical and Computer 
Engineering Conference Papers, Posters and Presentations by an authorized administrator of Iowa State University 
Digital Repository. For more information, please contact digirep@iastate.edu. 
Beyond the arithmetic constraint: depth-optimal mapping of logic chains in LUT-
based FPGAs 
Abstract 
Look-up table based FPGAs have migrated from a niche technology for design prototyping to a valuable 
end-product component and, in some cases, a replacement for general purpose processors and ASICs 
alike. One way architects have bridged the performance gap between FPGAs and ASICs is through the 
inclusion of specialized components such as multipliers, RAM modules, and microcontrollers. Another 
dedicated structure that has become standard in reconfigurable fabrics is the arithmetic carry chain. 
Currently, it is only used to map arithmetic operations as identified by HDL macros. For non-arithmetic 
operations, it is an idle but potentially powerful resource. 
This work presents ChainMap, a polynomial-time delay-optimal technology mapping algorithm for the 
creation of generic logic chains in LUT-based FPGAs. ChainMap requires no HDL macros be preserved 
through the design flow. It creates logic chains, both arithmetic and non-arithmetic, in an arbitrary 
Boolean network whenever depth increasing nodes are encountered. Use of the chain is not reserved for 
arithmetic, but rather any set of gates exhibiting similar characteristics. By using the carry chain as a 
generic, near zero-delay adjacent cell interconnection structure an average optimal speedup of 1.4x is 
revealed, and an average relaxed speedup of 1.25x can be realized simultaneously with a 0.95x LUT 
utilization decrease. 
Keywords 
Carry Chain, Logic Chain, Depth Optimal Mapping 
Disciplines 
Computer and Systems Architecture | Systems and Communications 
Comments 
This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not 
for redistribution. The definitive version was published in Frederick, Michael T., and Arun K. Somani. 
"Beyond the arithmetic constraint: depth-optimal mapping of logic chains in LUT-based FPGAs." 
Proceedings of the 16th international ACM/SIGDA symposium on Field Programmable Gate Arrays 
(2008): 37-46. DOI: 10.1145/1344671.1344679. 
This conference proceeding is available at Iowa State University Digital Repository: https://lib.dr.iastate.edu/
ece_conf/120 
Beyond the Arithmetic Constraint: Depth-Optimal Mapping
of Logic Chains in LUT-based FPGAs
Michael T. Frederick and Arun K. Somani ∗
Iowa State University
Department of Electrical and Computer Engineering
Ames, IA 50011, USA
michael.t.frederick@gmail.com, arun@iastate.edu
ABSTRACT
Look-up table based FPGAs have migrated from a niche
technology for design prototyping to a valuable end-product
component and, in some cases, a replacement for general
purpose processors and ASICs alike. One way architects
have bridged the performance gap between FPGAs and ASICs
is through the inclusion of specialized components such as
multipliers, RAM modules, and microcontrollers. Another
dedicated structure that has become standard in reconfig-
urable fabrics is the arithmetic carry chain. Currently, it
is only used to map arithmetic operations as identified by
HDL macros. For non-arithmetic operations, it is an idle
but potentially powerful resource.
This work presents ChainMap, a polynomial-time delay-
optimal technology mapping algorithm for the creation of
generic logic chains in LUT-based FPGAs. ChainMap re-
quires no HDL macros be preserved through the design flow.
It creates logic chains, both arithmetic and non-arithmetic,
in an arbitrary Boolean network whenever depth increasing
nodes are encountered. Use of the chain is not reserved for
arithmetic, but rather any set of gates exhibiting similar
characteristics. By using the carry chain as a generic, near
zero-delay adjacent cell interconnection structure an average
optimal speedup of 1.4x is revealed, and an average relaxed
speedup of 1.25x can be realized simultaneously with a 0.95x
LUT utilization decrease.
Categories and Subject Descriptors




∗The research reported within is partially supported by NSF
grant no. CCF0311061, the ISU Jerry R. Junkins Endow-
ment, and the Dept. of Education GAANN Fellowship.
Figure 1: (a) (K-1) carry-select chain, (b) {K−1,K}
heterogeneous logic chain
Keywords
Carry Chain, Logic Chain, Depth Optimal Mapping
1. CHAINS IN FPGAS
Look-up table (LUT) based Field Programmable Gate Ar-
rays (FPGAs) have traditionally been relegated to the realm
of prototyping because they lacked the performance neces-
sary to be critical pieces of a production design. However,
advances in codesign, process technology, and innovative ar-
chitectures have narrowed the performance gap between FP-
GAs and ASICs to the point where their flexibility and rel-
atively low cost have made them justifiable design choices.
Modern FPGAs have embedded dedicated components such
as multipliers, RAM modules, and microcontrollers along-
side reconfigurable logic in an effort to provide the special-
ized resources to achieve the necessary performance. One
important dedicated structure present in nearly all commer-
cially available architectures is the arithmetic carry chain.
Design depth is created by logic and programmable rout-
ing connections. The routing array provides flexible inter-
connection between any LUTs. Fig. 1(a) presents an arith-
metic chain consisting of logic elements (LEs). Each LE can
act as two (K-1)-LUTs or one K-LUT. The primary inputs
(PIs) have a routing depth g = 0, and for each LE g = 1,
because the path to any LE in the chain traverses only 1
routing connection and increases only logic depth. The first
member of each chain has a logic depth of l = 1, while the
last has l = 4. Carry chains provide near-zero delay trans-
mission of a carry, but are invoked only through hardware
description language (HDL) macros. Each chain node is a
depth increasing node, one that increases logic depth with-
Figure 2: (a) (K-1)-LUT mode, (b) K-LUT mode
out increasing routing depth. However, for designs that con-
tain few arithmetic operations and incorporate a carry-select
style architecture, the carry chain is an idle resource.
FlowMap [2] and its derivative algorithms solve optimal
logic depth mapping in polynomial time. However, they
view FPGAs sans the industry standard carry chain. The
optimal use of logic chains requires the ability to identify
a depth increasing node and implement it in a chain net,
thereby minimizing routing depth. Due to the increasing
inclusion of FPGAs in deployed systems, the need to fully
utilize every architectural resource is imperative, so as to
close the performance gap between FPGAs and ASICs.
FPGAs typically use ripple-carry schemes, or variations
thereof, for area efficient arithmetic. The Altera Stratix and
Cyclone architectures [1] use a carry-select chain, character-
ized in Fig. 2(a). An LE operating in (K-1) mode contains
two (K-1)-LUTs, one driving a chain net through the cout
port, and the other driving the general routing array trough
gr. These LEs facilitate chains as in Fig. 1(a). The Stratix
also incorporates an LUT chain, characterized in Fig. 2(b),
wherein one K-LUT simultaneously drives the same logic
function to the chain net and general routing. The Stratix
LUT chain uses a connection between LEs, separate from
the carry chain, to achieve K-LUT mode and form hetero-
geneous chains as in Fig. 1(b). A modified carry-select ar-
chitecture presented in [5] operates in either mode depicted
in Fig. 2, and forms heterogeneous chains as in Fig. 1(b). It
does so without the additional wire required by the Stratix
LUT chain, instead reusing the existing carry chain. The
presented algorithm assumes an LE that can operate in ei-
ther mode in Fig. 2, i.e. the Stratix or [5], and is not cur-
rently suitable for Xilinx devices or the Stratix II/III.
Architectures supporting logic chains are available, but
they are useless unless a CAD tool can efficiently implement
them. Current software packages identify arithmetic carry
chains through high-level HDL macros and primitives. The
LUT chain is mapped by Quartus II during place and route
(PNR) according to undisclosed metrics. The only recourse
for a designer wanting logic chains is to create them with
low level primitives or hand modify the design. The most
common academic synthesis tool, SIS [8] does not support
arithmetic chains in its internal representation.
There are many variations on technology mapping in lit-
erature. Solutions are designed to optimize delay, area,
routing congestion, power consumption, or any combination
thereof. Unfortunately, the simultaneous optimal solution of
multiple performance metrics has been proven NP-complete
[3]. FlowMap [2] is the first solution to map a design to a
K-LUT architecture with optimal logic depth in polynomial
time. It uses the network flow Max-flow Min-cut algorithm
[4] to enumerate K-feasible cuts in a network.
Logic depth minimization assumes that the nets connect-
ing LUTs are routing nets. Quartus II estimates for its entire
Stratix family that the variable routing delay is typically be-
tween 300 ps to 2 ns, while a chain net contributes 0 ps of
wire delay. By comparison, a Stratix 4-LUT is estimated at
366 ps and carry chain logic at 58 ps. In most circuits, 70%
of the delay is due to routing traversals, and most of the
remaining due to LUTs. Almost none of the delay is due to
the carry chain logic/interconnection.
Clearly, mapping should address routing depth rather than
logic depth. Chains are an underutilized, low latency re-
source waiting to be exploited. This work presents a poly-
nomial time, depth optimal logic chain technology mapping
solution, applicable to the Stratix K-LUT chain and the K-
width chain reuse cell presented in [5], and easily adapted
to standard (K-1) carry-select chains. The motivation is to
create generic logic chains not limited to arithmetic opera-
tions. Through optimal use of near zero-delay carry nets,
designs achieve greater performance. An extension to this is
the ability to disregard HDL macros and free the design flow
to work on an entire design, unfettered, toward any goal.
Section 2 formulates the problem and proves optimality.
The algorithm labels nodes and generates a set of minimum
height K-feasible cuts, maps those nodes according to cuts,
and duplicates nodes to adhere to characteristics unique to
logic chains. Performance is judged in Section 3 using three
different methods of handling HDL-defined arithmetic carry
chains. Section 4 concludes with discussion of results, po-
tential impact, and future work.
2. DEPTH OPTIMAL CHAIN MAPPING
The optimal routing depth technology map solution de-
scribed by ChainMap is partially based on the optimal logic
depth FlowMap [2], and is formulated similarly for ease of
comparison. SIS [8] nomenclature is used to describe an ar-
bitrary Boolean network. Such a network can be represented
as a directed acyclic graph (DAG) N = (V,E) with vertices
V and edges E, where n = |V | and m = |E|. Each Boolean
gate in the network is represented as a node, and edge(u, v)
connects nodes u, v ∈ V if there exists a net from the out-
put of gate u to an input of gate v. Notation is abused such
that u ∈ N implies that u ∈ V and edge(u, v) ∈ N implies
edge(u, v) ∈ E for N = (V,E). A predecessor is defined as a
node u such that there exists a directed path from u to v for
u, v ∈ N . Likewise, a descendant is a node v such that there
exists a directed path from u to v for u, v ∈ N . PIs have no
incoming edges and POs have none outgoing. The following
definitions will be used in the description of ChainMap:
• u, v, w, x are general nodes in a graph
• PI(N) and PO(N) refer to the set of primary inputs
or outputs of N , respectively
• i, j are scalar indices used with nodes
• s is an auxiliary global source node, s.t. ∀v ∈ PI(N),
edge(s, v) is added
• t denotes a sink node, and Nt is a subgraph of N con-
taining node t and its predecessor nodes and edges
• s denotes a source node, and Ns is a subgraph of N
containing node s and its descendant nodes and edges
• d is a depth increasing node
• g(v) is the routing label and l(v) the logic label for v
• p is a scalar s.t. p = max{g(u) : u ∈ N}
• q is a scalar s.t. q = max{l(u) : u ∈ N}
• P ⊆ Nt s.t. v ∈ P if g(v) = p,∀v ∈ Nt
• Pd ⊆ P consisting of d and its predecessors in P
• N ′t is a DAG with a valid depth increasing node
• N ′′t is derived from N ′t to apply Max-flow Min-cut
• d′ ∈ N ′t is formed by collapsing the nodes in Pd into d
• t′ ∈ N ′t is formed by collapsing the nodes in Pd into t
• (X,X), (Y, Y ),(Z,Z) denote node cuts in a network,
e.g. nodes are partitioned so that s ∈ X and t ∈ X
• input(H) for a set H ⊆ N , is the set of {u : ∀u /∈
H, v ∈ H, ∃edge(u, v)}, and is also abused for nodes
• output(H) for a set H ⊆ N , is the set of {u : ∀u ∈
H, v /∈ H, ∃edge(u, v)}, and is also abused for nodes
• cap(u, v) denotes the flow capacity of edge(u, v)
• LUT (t) is the set of nodes in the K-LUT of t
Through abuse of notation, a node or set denoted as“prime”
indicates to which network it belongs. For example, (X ′, X ′)
is a cut belonging to network N ′t . A K-feasible cone Nv is
a subgraph of N containing v and each of its predecessors
such that input(Nv) ≤ K. The goal is to cover K-bounded
N , where ∀v∈V |input(v)| ≤ K, with K-feasible cones for
implementation in a K-LUT FPGA.
The level of t is the longest path from any PI predecessor
of {u : u ∈ PI(Nt), u 6= t} to t, with PIs possessing a level
of 0. The distinction that ChainMap makes from FlowMap
is that level is in terms of the maximum number of routing
connections traversed from PI(Nt) to t. Chain connections
do not count as a routing level increase, therefore, if the
longest path between a PI and node t traverses g routing
connections and c chain connections, level(t) = g. The depth
of the network is the maximum level of all its vertices.
As in FlowMap, the concept of a network cut, (X,X),
is pivotal. The node cut size, given by Eqn. 1, quantifies
the size of input(X), i.e. the number of nodes that have a
forward edge crossing the cut. To find the K-feasible node
cut, the edge cut size will be employed, according to Eqn. 2.
For the remainder of the algorithm discussion a unit delay
model is incorporated, meaning that cap(u, v) = 1, ∀u, v ∈
V . The logic height of the cut is the maximum node label
in X, as in Eqn. 3. The routing height of the cut is the
maximum node label in X, as in Eqn. 4.





hL(X,X) = max{l(u) : u ∈ X} (3)
hG(X,X) = max{g(u) : u ∈ X} (4)
The primary objective is to minimize the network routing
delay by minimizing hG(X,X) for all nodes. Using a binary
depth model, each routing net increases routing depth by
1, but it is not increased by any chain net. The secondary
objective is to minimize the logic delay of the network by
minimizing hL(X,X) for all nodes such that hG(X,X) is
minimum, because network delay is also defined by the de-
lay through its K-LUTs. A third objective is to minimize
the area of the design in terms of the number of K-LUTs
required by the solution. A solution is optimal if the net-
work routing depth is minimum and the logic depth, within
the confines of minimum routing depth, is also minimum.
ChainMap consists of three phases: labeling, mapping,
and duplication, with an optional fourth, relaxation. In the
labeling phase, ChainMap identifies whether or not a DAG
can be constructed that consists of a given node t and its pre-
decessors, and contains a depth increasing node d. If such
a DAG is possible, two subsequent graph transformations
are applied that isolate d in N ′t and convert the network to
N ′′t , one to which Max-flow Min-cut can be applied. If a
K-feasible cut can be found, then t does not increase the
routing depth of the design. If t = d, this is akin to the min-
imum height logic cut identified by FlowMap, and contains
all other possible cuts. The second phase of ChainMap is
identical to that of FlowMap, wherein the K-feasible cuts
computed during labeling are used to form K-LUTs. The
third phase of ChainMap is to duplicate nodes that source
multiple chain nets to adhere to the special constraints im-
posed by chains. An optional relaxation phase can be ap-
plied to restrict the number of duplications required.
2.1 ChainMap Labeling
ChainMap correlates g(v) to the general routing depth of
node v. This is a subtle change in definition from FlowMap,
which uses l(v) to indicate both logic and routing depth be-
cause it considers all nets to be routing connections. The
introduction of the logic chain provides for a net with prop-
erties different from general routing. A chain net allows any
u ∈ input(v) to cause l(v) = l(u) + 1 while allowing for the
possibility that g(v) = g(u).
The labeling phase is performed on a topological order-
ing of the nodes in N , ensuring that node u ∈ input(v) is
processed before v. N is K-bounded, meaning input(u) ≤
K, ∀u ∈ N . Each u ∈ PI(N) has g(u) = l(u) = 0. Fig. 3(a)
shows an example Nt where all edges traversing to u /∈ Nt
have been pared away, and the auxiliary source s added.
If LUT (t) denotes the set of nodes in the K-LUT which
implements t, then X = LUT (t) and X = Nt − LUT (t).
Given X and X, a K-feasible cut (X,X) is formed such that
s ∈ X and t ∈ X and n(X,X) ≤ K. A depth increasing node
is one which is solely responsible for increasing the routing
depth of LUT (t).
Definition 2.1. Let node d ∈ input(X) be a node with
maximum label g(d) = p. If g(d) > g(v), ∀v ∈ input(X), v 6=
d, then d is depth increasing.
Let u ∈ X be a node with p = g(u) and d be a depth
increasing node, then the routing label of t is g(t) = p if
d ∈ X and g(t) = g(u) + 1 otherwise. Eqn. 4 indicates that
to minimize the hG(X,X) of LUT (t), the minimum height
K-feasible cut (X,X) must be found in Nt.






0 if d ∈ X
1 otherwise
Let v ∈ X be the maximum logic label q = l(v), then
l(t) = l(v) + 1. The logic label of t is dependent on the K-
feasible minimum height routing cut (X,X). Because the
nodes in X and X represent nodes in different LUTs, logic
depth simply increases at each routing cut.
Lemma 2.3. The logic depth of Nt is given by:
l(t) = hL(X,X) + 1




t for chain cut.
Furthermore, for any t, g(t) ≥ g(u) and l(t) ≥ l(u),
∀u ∈ input(t). This is important because the value g(t)
has two possibilities: if a minimum height cut can be found
at hG(X,X) = p− 1 or hG(X,X) = p, d ∈ X then g(t) = p,
otherwise g(t) = p + 1. Likewise, the logic label of t fol-
lows a similar derivation and its proof is identical to that
presented by Lemma 2 in FlowMap [2]. For purposes of dis-
cussion, this proof is excerpted as Lemma 2.5. Lemmas 2.4
and 2.5 ensure that the routing and logic labels of each node
are greater than or equal to any of their predecessors.
Lemma 2.4. If p is the maximum routing label of the nodes
in input(t), then g(t) = p or g(t) = p+ 1 .
Proof. If u ∈ input(t), then any cut (X,X) ∈ Nt results
in either u ∈ X or u ∈ X.
When u ∈ X, Eqn. 4 requires that hG(X,X) ≥ g(u) and
by Lemma 2.2 g(t) ≥ hG(X,X), therefore, g(t) ≥ g(u).
When u ∈ X, the K-feasible cut (X,X) defines a K-
feasible cut (Y, Y ) in Nu, where Y = X ∩ Nu and Y =
X ∩ Nu. Let (Z,Z) be the minimum height K-feasible cut
computed for Nu. Since (Z,Z) is the minimum height cut,
then h(Y, Y ) ≥ h(Z,Z) because Z ⊆ Y . Likewise, since Y ⊆
X, h(X,X) ≥ h(Y, Y ), therefore, h(X,X) ≥ h(Z,Z). There
are two possible values for both g(t) and g(u) according to
Lemma 2.2, resulting in four possible cases. Fig. 4(a) applies
to i and ii, while (b) applies to iii and iv.
(i) If g(t) = h(X,X) + 1, g(u) = h(Z,Z), then g(t) >
h(X,X) ≥ h(Z,Z) = g(u), thus g(t) > g(u).
(ii) If g(t) = h(X,X) + 1, g(u) = h(Z,Z) + 1, then g(t)−
1 = h(X,X) ≥ h(Z,Z) = g(u)− 1, thus g(t) ≥ g(u).
(iii) If g(t) = h(X,X), g(u) = h(Z,Z),then g(t) = h(X,X) ≥
h(Z,Z) = g(u), thus g(t) ≥ g(u).
(iv) If g(t) = h(X,X), g(u) = h(Z,Z) + 1, then d ∈ X. By
Def. 2.1, g(d) > g(v), ∀v ∈ input(X), v 6= d. If d /∈ Y
then all of Y is less than g(d), and g(t) = h(X,X) =
g(d) > h(Y, Y ) ≥ h(Z,Z) = g(u)−1, thus g(t) ≥ g(u).
If d ∈ Y , Fig. 4(c), then g(t) = h(X,X) = h(Y, Y ) =
g(d). Because d is a depth increasing node of t, and
input(Y ) ⊆ input(X) then d is also a depth increasing
Figure 4: Conceptual for (a) d /∈ Nu, g(t) = h(X,X)+1
(b) d /∈ Nu, g(t) = h(X,X) and (c) d ∈ Nu, g(t) = h(X,X)
node of u, but it is known that g(u) = h(Z,Z) + 1,
which by Lemma 2.2 indicates d /∈ Z, implying d ∈ Z.
Since d ∈ Z, then h(Z,Z) = g(d) − 1. Therefore,
g(t) = g(d) = h(Z,Z) + 1 = g(u), thus g(t) = g(u).
A valid alternative K-feasible cut is when (Nt − {t}, {t})
because N is K-bounded. In this situation, any node u ∈
Nt − {t} is either u ∈ input(t) or a predecessor of those
nodes, such that u ∈ Nt − input(t) − {t}. Therefore, the
maximum routing label, g(u) = p, where u ∈ Nt − {t}, and
hG(Nt − {t}, {t}) = p, resulting in g(t) ≤ p + 1. Items i-iv
prove g(t) ≥ g(u), ∀u ∈ input(t), thus p ≤ g(t) ≤ p+ 1.
Lemma 2.5. If q is the maximum logic label of the nodes
in input(t), then l(t) = q or l(t) = q + 1.
Proof. If u ∈ input(t), then any cut (X,X) ∈ Nt results
in either u ∈ X or u ∈ X.
When u ∈ X, Eqn. 3 requires that hL(X,X) ≥ l(u) and
by Lemma 2.3 l(t) ≥ hL(X,X), therefore, l(t) ≥ l(u).
When u ∈ X, (X,X) defines a cut (Y, Y ) in Nu, where
Y = X ∩ Nu and Y = X ∩ Nu. Therefore, hL(X,X) ≥
hL(Y, Y ) because Y ∈ X indicating that l(u) ≤ hL(Y, Y ) ≤
hL(X,X) ≤ l(t). Therefore all predecessors of u ∈ Nt − {t}
are l(u) ≤ l(t). This implies that l(u) ≤ l(t), ∀u ∈ input(t),
resulting in l(t) ≥ q.
A valid alternative K-feasible cut is (Nt−{t}, {t}) because
N is K-bounded. In this situation, any u ∈ Nt − {t} is
either u ∈ input(t) or a predecessor of those nodes, such that
u ∈ Nt−input(t)−{t}. Therefore, the maximum logic label,
l(u) = q, where u ∈ Nt − {t}, and hL(Nt − {t}, {t}) = q,
resulting in l(t) ≤ q + 1. Therefore, q ≤ l(t) ≤ q + 1.
Lemma 2.4 dictates minimum routing depth is achieved if
g(t) = p, either by a depth increasing node d, or by g(u) =
p − 1, ∀u ∈ Nt − LUT (t). Each v ∈ Nt for which g(v) = p
or v = t belongs to set P and is an eligible depth increasing
node. To see if any d ∈ P is depth increasing, P must be
partitioned into Pd and Pd, as in Fig. 3(a). For any d ∈ P ,
a depth first search (DFS), toward PIs rooted at d and in
P , yields Pd and Pd = P − Pd. Fig. 3(a) shows Pd = {d, a},
which constitutes a logic chain at level p, and Pd = {t, b},
which constitutes LUT (t). If Pd 6= ∅, t ∈ Pd and consists
of nodes potentially included in LUT (t), and its contents
collapsed into t to form t′. If d = t, Pd = ∅ indicating that
LUT (t) includes all of the nodes in P (as P = Pd), and the
contents of P are collapsed into t to form t′.
Lemma 2.6. Let set P contain {v : v ∈ Nt, g(v) = p} ∪
{t}. For d ∈ P , let Pd be the DFS tree rooted at d and in P ,
and Pd = P − Pd. N ′t contains a depth increasing node d if
there exists no edge(u, v), where u ∈ Pd − {d} and v ∈ Pd.
Proof. If d = t, then Pd = ∅ and t′ is formed by collaps-
ing P . Here, because t is not a predecessor of any node yet
labeled in N it is assumed to be the depth increasing node
of its unknown descendant until proven otherwise.
When d 6= t, t′ is created by collapsing the nodes in Pd.
The lack of an edge connecting any node in Pd−{d} to any
in Pd indicates that g(u) < p, ∀u ∈ input(t′), u 6= d. Using
proof by contradiction, assume d is a valid depth increasing
node and that there exists edge(u, v), where u ∈ Pd − {d}
and v ∈ Pd. It is known g(d) = p and d 6= u, implying
g(u) ≥ p. Therefore, (Nt−Pd, Pd) defines a cut where u, d ∈
input(Pd) and g(u) = g(d) = p. By Def. 2.1, d isn’t a
valid depth increasing node because ∃edge(u, v) ∈ Nt where
d 6= u, which is a contradiction.
The presence of a valid d ∈ Nt can be ensured, however,
it doesn’t guarantee that it can be identified correctly. N ′t
doesn’t guarantee that a K-feasible cut, if it exists, will not
divide Pd and result in an invalid routing cut (X,X) s.t.
g(u) = g(v), ∀u, v ∈ input(X), u 6= v, d ∈ X. The solution
is to collapse all of the nodes of Pd into d
′, as in Fig. 3(b),
thereby creating N ′t with d
′ as the lone predecessor node of
t′ with g(d′) = p when d 6= t, and d′ = t′ when d = t. As
there may be more than one valid depth increasing node, all
d ∈ P must be tested as a valid depth increasing node and
for K-feasible cut. Using Lemma 2.5, the logic label can be
used to select the d that produces minimum hL(X,X).
Any Nt that does not contain a d is deemed invalid and is
eliminated from consideration. The case when d = t implies
that g(t) = p and t is regarded as the first cell in a chain. If
a valid N ′t is formed, and a K-feasible cut is found in it, a
corresponding K-feasible cut can be found in Nt.
Lemma 2.7. Given a valid N ′t with d
′, Nt has a p − 1
height K-feasible routing cut when d ∈ X and p when d ∈ X
if and only if N ′t has a K-feasible routing cut.
Proof. Let T denote the set of nodes in Nt that are
collapsed into t′ and D denote the set of nodes in Nt that
are collapsed into d′.
If d′ ∈ X ′ or d′ = t′, then X = (X ′ − {d′, t′}) ∪ D ∪ T
and X = X ′. Accordingly, (X,X) is a K-feasible cut of
Nt because input({d′, t′}) = input(D ∪ T ). Consequently,
hG(X,X) ≤ p−1 becauseX ′ = X does not contain any node
with routing label p or higher, as all such nodes are located
in (D ∪ T ) ⊆ X. According to Lemma 2.4, g(t) ≥ p implies
that hG(X,X) ≥ p − 1. Since p − 1 ≤ hG(X,X) ≤ p − 1,
then hG(X,X) = p− 1.
If d′ ∈ X ′, then X = (X ′−{t′})∪T and X = (X ′−{d′})∪
D. Accordingly, (X,X) is a K-feasible cut of Nt because
input(t′) = input(T ). Lemma 2.6 yields hG(X,X) = p
because g(d) = p and d ∈ X. Furthermore, Lemma 2.6
indicates that g(u) < p,∀u ∈ input(X), u 6= d.
Using a valid N ′t with d
′, the flow residual graph N ′′t is
constructed. The node cut-size problem is transformed to
an edge cut-size problem by splitting each node, allowing the
use of the Max-flow Min-cut algorithm. For {v : v ∈ N ′′t , v 6=
s, v 6= t′}, replace {v} with {v1, v2} connected by bridging
edge(v1, v2) with cap(v1, v2) = 1, input(v1) = input(v), and
output(v2) = output(v). Give all non-bridging edges infinite
capacity. The result is flow residual graph N ′′t , to which the
Max-flow Min-cut algorithm can be applied to determine if
there is a K-feasible cut, and therefore a corresponding cut
in N ′t [4]. This technique is exactly the same as that used in
Lemma 4 of FlowMap [2] and is summarized in Lemma 2.8.
Lemma 2.8. N ′t has a K-feasible routing cut if and only
if N ′′t has a K-feasible routing cut.
Proof. Using the Max-flow Min-cut Theorem [4], N ′′t
has a cut with e(X ′′, X ′′) ≤ K if and only if the maximum
flow between s and t′ is no more than K. Each bridging edge
in flow residual graph N ′′t has capacity of 1, thus the aug-
menting path algorithm can be used to find maximum flow.
If K + 1 augmenting paths are found, N ′′t cannot possess
a K-feasible edge cut. If K or fewer augmenting paths are
found, e(X ′′, X ′′) ≤ K, resulting in a disconnection of the
N ′′t before finding the (K + 1)
th path. The K-feasible node
cut (X ′′, X ′′) can be identified by performing a DFS rooted
at s on the nodes in N ′′t that are reachable in the residual
graph. N ′′t induces a node cut (X
′, X ′) in N ′t by creating
u ∈ input(X ′) corresponding to u1 ∈ input(X ′′).
The ability of the depth increasing node to be any {d : d ∈
Nt, g(d) = p} creates multiple valid LUT (t) sets, each with
equal routing depth but potentially different logic depth.
For each Nt with a K-feasible node cut as found in N
′′
t ,
the optimal overall depth cut can be found by choosing the
minimum hL(Xt, Xt) according to Eqn. 3.
Lemma 2.9. If hL(Xt, Xt) > hL(X,X), the minimum
routing and logic depth solution of Nt is (Xt, Xt) = (X,X).
Let m be the number of edges in Nt. Given the preceding
discussion, a minimal depth solution uses a O(n) search for
d, a O(m+n) DFS search for its predecessors, and O(K ·m)
to identify the minimum depth routing cut for each d.
Theorem 2.10. A minimum height routing cut with min-
imum logic depth in Nt can be found in O(n
2 +K ·m · n).
Applying Theorem 2.10 in topological order yields a label-
ing of Nt such that the routing depth of t is minimum and,
within its confines, the logic depth is also minimum. This
yields a complete labeling solution for each node in N .
Corollary 2.11. A minimum depth solution of N can
be found in O(n3 +K ·m · n2).
2.2 ChainMap Mapping
The mapping phase of the ChainMap algorithm is iden-
tical to that of FlowMap and its proof is reproduced here
for the sake of completeness. It consists of creating a set
T that initially contains all the POs. For each t ∈ T , a
minimum height cut (Xt, Xt) was computed during label-
ing. Using this cut, t′ is created from the nodes in Xt and
is the K-LUT implementing all nodes in Xt. T is updated
as (T −{t})∪ input(t′), and the process is repeated until all
of the nodes in T are PIs. It remains valid for ChainMap as
long as node labeling is performed as prescribed in Sec. 2.1.
Theorem 2.12. For any K-bounded Boolean network N ,
ChainMap produces a K-LUT mapping solution with mini-
mum depth in O(m+ n) time.
Proof. By induction, for any node t ∈ N , if a K-LUT t′
is generated for t during the mapping phase, then the level
of t′ in the mapping solution is no more than g(t) and l(t),
the depth of the optimal mapping solution for Nt. Since any
mapping solution for N induces a solution for Nt, g(t) and
l(t) are also the minimum depths for the K-LUT generated
for t in and mapping solution of N . Therefore, the mapping
solution of N is optimal and requires O(n+m) time [2].
Corollary 2.13. Labeling requires O(n3 + K · m · n2),
and mapping requires O(n+m). Hence, the first two stages
of ChainMap are polynomial in O(n3 +K ·m · n2) +O(n+
m) = O(n3 + K · m · n2). In practice, m = O(K · n) and
K = {4, 5, 6}, making their runtime O(n3).
A logic chain is defined as a series of depth increasing
nodes, such that the logic depth of each consecutive chain
node increases, while the routing depth remains constant.
Definition 2.14. A logic chain is a subnetwork L ⊆ N
such that g(uj) = g(ui), l(uj) = l(ui) + 1,∀ui, uj ∈ L.
2.3 ChainMap Duplication
The exclusivity constraint of chains is defined as the re-
quirement that a chain net be a single-source, single-sink
relationship between adjacent LEs. When the network is
viewed as a set of LUTs, as in SIS internal representation, it
means that a node t can have at most two chain outputs u
and v. However, there are constraints on which LUTs can be
part of the same LE, assuming that an architecture allows a
full K-LUT function on the chain. Note that a discussion of
N now assumes that the mapping phase has been applied,
thus references to t indicate the actual K-LUT formed by
collapsing the nodes in LUT (t) to t.
Lemma 2.15. For each t ∈ N , if {u, v : u, v ∈ output(t), v 6=
u, g(t) = g(u) = g(v)} satisfy the following constraints,
{u, v} can populate the same LE. If any u cannot be paired
with any v, u is implemented in an LE by itself.
(i) If input(u) = input(v) and |input(u)| = |input(v)| =
K, then u and v must compute the same function.
(ii) If |input(u)∪input(v)| < K, then u and v can compute
separate functions.
(iii) For a pair u, v ∈ output(t), g(w) > g(u), ∀w ∈ output(u)
and g(x) = g(v),∀x ∈ output(v).
(iv) u /∈ input(v) and v /∈ input(u).
Algorithm 1 The ChainMap Algorithm
1: procedure ChainMap(N)
2: for v ∈ N do . Phase 1:Labeling
3: l(v) = g(v) = 0
4: end for
5: T = N − PI(N) in topological order
6: while |T | > 0 do
7: T = T − {t}; Nt = DFS(N, t); add global source s
8: let p = max{g(u) : u ∈ input(t)};
9: let q = max{l(u) : u ∈ input(t)}
10: Xt = ∅;
11: let P = {u : u ∈ Nt, g(u) = p} in topological order
12: for {d : d ∈ P} do
13: let Pd = DFS(P, d); Pd = P − Pd
14: if ∃edge(u, v), ∀u ∈ Pd − {d}, v ∈ Pd then
15: Nt is invalid for d, skip rest of for loop
16: end if
17: form d′ by collapsing u ∈ Pd into d
18: if Pd = ∅ then t′ = d′
19: else
20: form t′ by collapsing u ∈ Pd into t
21: end if
22: create N ′t with t
′ and d′
23: split {v : v ∈ N ′t : v 6= s, v 6= t} into {v1, v2}
24: assign cap(v1, v2) = 1 to bridge edges, ∞ to all others
25: MaxFlowMinCut(N ′′t )
26: if {∃(X′′, X′′) : e(X′′, X′′) ≤ K} then
27: induce (X′, X′) in N ′t from (X
′′, X′′) in N ′′t
28: induce (X, X) in Nt from (X′, X′) in N ′t
29: if hL(X, X) < hL(Xt, Xt) then




34: if Xt 6= ∅ then
35: g(t) = p; l(t) = hL(Xt, Xt) + 1
36: else
37: g(t) = p + 1; l(t) = hL(Xt, Xt) + 1
38: end if
39: end while
40: T = PO(N) . Phase 2:Mapping
41: while {t : t ∈ T, t /∈ PI(N)} do
42: form LUT t′ by collapsing v ∈ Xt into t
43: T = (T − {t}) ∪ input(t′)
44: end while
45: RelaxShortestLogic(N) . Optional Relaxation
46: T = N − PI(N) in reverse topological . Phase 3:Duplication
47: while T 6= ∅ do
48: T = T − {t};
49: L = {u, v : u, v ∈ output(t), g(t) = g(u) = g(v)}
50: for u, v ∈ L do
51: if {u, v} is a valid LE and L− {u, v} 6= ∅ then
52: Create t′ as a duplicate of node t
53: output(t) = output(t)−{u, v}; output(t′) = {u, v}
54: L = L− {u, v}
55: end if
56: end for
57: while |L| > 1 do
58: L = L− {u}




In Lemma 2.15(i), the number of distinct inputs for nodes
{u, v} meeting |input(u)∪ input(v)| ≤ K does not necessar-
ily ensure that the computation resources are available in an
LE. If either |input(u)| = K or |input(v)| = K, then {u, v}
cannot reside in the same LE because there can only be
one K-input function computed by the LE, as in Fig. 2(b).
However, if both |input(u)| < K and |input(v)| < K, the LE
has enough LUT resources to accommodate both sub-width
functions, reflected in Lemma 2.15(ii), and in Fig. 2(a). Ex-
clusivity also requires that outputs of u and v are heteroge-
neous. That is, u must only source a routing net, while v
Figure 5: Chain tree (a) before, (b) worst case du-
plication, (c) average case with relaxation.
must only source a chain net, as in Lemma 2.15(iii). This
constraint indicates that an LE has only one available cout
port and one sum port. It should be noted that the use of
the terms cout and sum refer only to the type of net a node
drives, chain or routing, respectively. It does not indicate
the Boolean function computed by either node, it is merely
borrowed nomenclature from carry-select addition. If nodes
u and v are to be contained in the same LE, one must ex-
clusively use the cout port, and one must exclusively use the
sum port. Finally, Lemma 2.15(iv) indicates u and v cannot
be dependent on each other because there is not internal LE
connection between the sum and cout LUTs.
If a node has more than one chain net output, it must
be duplicated if its descendants cannot meet the aforemen-
tioned constraints. Fig. 5(a) shows a logic chain tree formed
by ChainMap. In it, all routing nets are omitted, and all
nodes are in logic chains. Original internal nodes are white,
leaf nodes are black, and duplicate nodes are gray. Us-
ing output(b) = {t1, t3, c} as an example, assume no LEs
can be formed of any pair. This precipitates two duplica-
tions of b, which causes output(a) = {b, t4} to change to
output(a) = {t4, b, b, b}. Assuming no LEs can be formed
of any pair in {t4, b, b, b}, a is duplicated three times, which
causes s to be duplicated at least three times. This pattern
continues for all nodes in Fig. 5(a), resulting in Fig. 5(b).
Lemma 2.16. The number of node duplications required
in N to satisfy exclusivity is O(n2).
Proof. Let Ns be a subgraph consisting of edges and
nodes discovered in a depth first search rooted at s ∈ N , such
that for u ∈ Ns, v ∈ output(s), v is visited only if g(u) =
g(v). By Def. 2.1, there can only exist one edge(u, v) ∈
Ns, ∀u, v ∈ Ns. Therefore, Ns is a logic chain tree with leaf
nodes denoted ti, 1 ≤ i ≤ |V (Ns)|, as in Fig. 5(a). Addi-
tionally, there exists a logic chain Lj , 1 ≤ j ≤ |V (Ns)| from
s to ti, pursuant to Def. 2.14.
The worst case area expansion occurs when u is duplicated
∀edge(u, v), ∀u, v ∈ Ns, v ∈ output(u). This implies the du-
plication network N ′s consists of each path from s to ti du-
plicated in its entirety. Fig. 5(b) demonstrates that N ′s con-
sists of a logic chain for each ti, because 1 ≤ i, j ≤ |V (Ns)|,
|V (N ′s)| = O(|V (Ns)|) · O(|V (Ns)|). Therefore, for N with
n nodes, the number of duplications is O(n2).
Theorem 2.17. For any K-bounded Boolean network N ,
a O(n2) expansion is performed for n nodes in N , and Chain-
Map produces a depth optimal solution valid within the ex-
clusivity constraint in O(n3) time.
Corollary 2.18. The labeling phase of ChainMap re-
quires O(n3 +K ·m ·n2), the mapping phase requires O(n+
m), and duplication requires O(n3). This makes the entire
ChainMap algorithm polynomial in O(n3+K ·m·n2)+O(n+
m) +O(n3) = O(n3 +K ·m ·n2). In practice, m = O(K ·n)
and K = {4, 5, 6}, making the complete runtime O(n3).
The ChainMap algorithm is presented in Algorithm 2.2
and includes all three stages. ChainMap maintains a poly-
nomial O(n3) runtime with mapped solution area bound by
O(n2) of the original network. Area is a big concern be-
cause ChainMap assumes its routing delay is equivalent to
that encountered in a traditional mapping solution. If the
worse case is encountered, the increased wire length usurps
any performance gains. Duplication is combated by relaxing
chain nets to allow more nodes to comply with Lemma 2.15.
2.4 ChainMap Relaxation
The classic trade-off between area and speed is extremely
evident in ChainMap solutions. Results indicate full dupli-
cation yields highly prohibitive area increases. For exam-
ple, the number of 5-LUTs in traditional mapping versus a
ChainMap solution increases from 4,752 to 9,835 for cfft
(K = 5, before, 2.07x). Relaxation of routing depth can
be used as a means for reducing area. In return for adding
a level of routing to some paths, a chain net and its dupli-
cation are eliminated. Because ChainMap makes all paths
of roughly uniform routing depth, the delay of the network
is dependent on the variance in logic depth. The goal is to
relax paths with minimum logic depth and mask the addi-
tional routing delay with paths of high logic depth.
Fig. 5(a) shows a DFS chain tree rooted at node s. As-
suming Lemma 2.15 is fulfilled, output(s) = {a, t5, d, g} can
form an LE of {a, t5}. Consequently, assuming {d, g} ful-
fill (i), (ii), and (iv), they still cannot form an LE because
they violate (iii). Duplications occur en masse under this
circumstance, along the longest network paths. Instead, if
edge(s, d) and edge(s, g) are relaxed from chain to routing
nets, the tree is disconnected at d and g, and at least 2 du-
plications of s are saved. Fig. 5(c) assumes that all nodes
satisfy Lemma 2.15, except for nodes {d, g}, which violate
item (iii), and t1 because {c, t3} form a valid LE. All are
relaxed because they are not along the longest logic branch
of their respective sub-trees. Fig. 5(b) shows the worst case
for area, while Fig. 5(c) shows the average case ChainMap
solution, with LE pairs circled in dotted lines.
For all s ∈ N and u, v ∈ output(s), the longest DFS
chain tree branch v and its valid LE mate u are preserved,
while output(s) − {u, v} are relaxed. Longer logic chains
are preserved, ultimately masking the delay of the relaxed
edge(s, v). This heuristic method specifically targets arith-
metic designs typically containing chain tree nodes with long
and short logic branches.
3. EXPERIMENTAL RESULTS
To accurately assess the effectiveness of the ChainMap
algorithm, it is necessary to test designs with HDL defined
arithmetic carry chains. For this purpose OpenCores [7]
DSP, security, and controller benchmarks have been selected
with a range of arithmetic penetrance. Fig. 6 depicts the
design flows, each inserting arithmetic at different points:
• Forget - Arithmetic chains are optimized by synthesis
and mapped with ChainMap without HDL.
Figure 6: Experimental Design Flows
• Before - Arithmetic chains are preserved through syn-
thesis, and reinserted before ChainMap.
• After - Arithmetic chains are preserved through syn-
thesis and ChainMap, and reinserted before PNR.
• Normal - Arithmetic chains are preserved through
synthesis and FlowMap, and reinserted before PNR.
Quartus II has an open netlist format, VQM, and an open
design flow where academic tools can be tested [6]. Because
SIS lacks HDL elaboration, a parser has been created to
implement a VQM netlist in SIS internal representation. An
option has been included to preserve arithmetic carry chains
or implement them as bit-sliced cout and sum operations.
The drawback to using Quartus II for HDL interpretation
is that optimization and K-LUT mapping on the netlist has
been performed before importing to SIS. To mitigate this,
the logic network is decomposed into 2-input AND and OR
gates and resynthesized with SIS using script.algebraic. The
speedup and area (i.e. number of LUTs) results produced
by the three ChainMap flows are normalized to the normal
flow. Speedup values greater than 1 represent a decrease in
delay. An LUT ratio of less than 1 indicates area savings.
Fig. 7 shows relaxed speedup averaged across all bench-
marks under all three flows for K = {4, 5, 6}. The indepen-
dent axis is the ratio of average routing delay to LUT delay
(G : L). Since routing delay is variable, Fig. 7 shows how
speedup is affected by changes in average routing delay rel-
ative to static LUT delay. Changing G : L shows how the
effectiveness of the heuristic relaxation technique changes as
average routing delay increases. Common G : L lies within
the range of [2, 4], which for Stratix is akin to an LUT delay
of 366 ps and routing delay of [732ps, 1464ps].
Tables 1, 2, and 3 show results for all benchmarks. They
present the optimal and relaxed routing (Go, G) and logic
(Lo, L) of the path with maximum routing depth and max-
imum logic depth, the speedup when G : L = 3, the relaxed
number of LUTs used, and ratio of ChainMap LUTs to nor-
mal (λ). They indicate that in all cases the optimal Chain-
Map solution is faster than HDL dictated chains. However,
the relaxed solutions represent a mixed record of taking ad-
vantage of this potential speedup, but do consistently reduce
the overall LUT utilization of a benchmark.
Benchmark results indicate optimal ChainMap performance
varies with flow and LUT size, but are equal to or better
than normal, as expected. Varying the value of K produces
results that mirror the expected result of incorporating more
logic into each LUT; as LUT size increases, speedup in-
creases and area decreases. Across all LUTs, the before and
Figure 7: Speedup of ChainMap flows relative to
normal flow vs. average routing to LUT delay ratio.
forget flows closely mirror each other, with an average differ-
ence of approximately 5%. This is a very important result,
as it means that arithmetic chains can be discovered and
mapped without relying on HDL macros. Although ignoring
HDL macros and using ChainMap with relaxation produces
solutions typically between 0.95x and 1.4x the speed of the
normal case, the optimal results indicate that there is still
potential performance increases to be realized.
An interesting phenomena occurs in Fig. 7 where, as G : L
increases, the effectiveness of relaxed before and forget in-
creases for K = 4, holds steady for K = 5, and decreases for
K = 6, while after increases for all K. This is due to the dis-
parate affect that the relaxation technique has on before and
forget versus after, and the decrease in nets as K increases.
Because relaxation is applied to shorter logic paths to mask
its effect with longer logic paths, as G : L increases, the abil-
ity to mask relaxed nets decreases for larger K. This does
not affect the after flow, because very few chain nets can
be identified by ChainMap when HDL macros are excluded
during mapping, thus relaxation is rarely applied. K = 4
maintains relatively deep logic depth due to many LUTs and
nets, while logic depth is reduced for K = {5, 6}, revealing
their lack of ability to mask relaxations as effectively.
The most heavily arithmetic design, the radix-4 FFT,
yields a relaxed solution that is 1.00x speedup of normal,
and an optimal solution of 1.11x (cfft, K = 5, before).
This indicates that ChainMap, coupled with the relaxation
procedure in Sec. 2.4, produces chains at least as well as
HDL macros, but that there may exist other less aggressive
LUT reduction relaxation techniques. The LUT results re-
flect this, with the ChainMap solution 0.71x that of normal,
indicating optimal performance can potentially be recouped
through different relaxation techniques, or relying on the
smaller design to yield shorter wires during PNR.
The phenomena of area reduction applies to nearly all de-
signs tested and can potentially increase speedup values uni-
versally. It stems from two sources, the first being that the
chain cut is a naturally more area aggressive. If a node fails
to join a logic level q (d = t) because of a cut size of greater
than K, ChainMap searches out an alternate K-feasible cut
(d 6= t). This cut is an alternative to implementing the node
on a new logic level and thus each chain cut tends to incorpo-
rate more nodes. The second, and more prevalent, reason is
that preserved arithmetic chains are typically 3-input gates
that are not merged with others and are ultimately imple-
mented as lone, underpopulated LUTs. ChainMap allows
these underpopulated LUTs to be packed together.
4. CONTRIBUTIONS AND IMPLICATIONS
ChainMap provides a polynomial time solution to the
problem of identifying generic logic chains in a Boolean net-
work. By looking at the problem of circuit depth from the
perspective of minimizing routing depth, it has been shown
that considerable performance gains can be realized. The
important contributions of ChainMap are as follows:
1. A formal logic chain definition is presented that encom-
passes both arithmetic and non-arithmetic operations.
2. ChainMap creates generic logic chains in polynomial
time without HDL arithmetic chain macros.
3. An area trade-off is necessary due to the exclusivity
constraint of current FPGA carry chain architectures,
but can be eliminated with relaxation.
4. ChainMap ensures logic chains can be created without
HDL, affording researchers an opportunity to rethink
CAD algorithm and FPGA architecture design.
5. FPGAs with carry-select inspired elements can take
advantage of ChainMap solutions.
The definition of a logic chain has been formalized as a
series of nodes, such that there is a directed edge(u, v) be-
tween adjacent nodes {u, v}, that causes the logic depth of
v to increase while not increasing its routing depth. This
definition addresses the fact that there is a clear difference
in the speed of routing versus chain nets, and guides their
use. The average speedup of ChainMap versus a traditional
mapping algorithm with HDL chains is 1.4x optimally and
1.25x relaxed, for K = {4, 5, 6} and reasonable average rout-
ing delays. While all K provide performance gains, when
K = {5, 6}, underpopulated HDL macro LUTs can more
often be packed together, yielding slightly higher average
speedup and LUT savings. This result concurs with results
for general networks, where K = {5, 6} yield the best depth
for LUT-based FPGAs [9]. An assessment of the impact
ChainMap has on place and route is left as future work.
ChainMap requires an area/speedup trade off, an artifact
of FPGAs enforcing the exclusivity constraint. However, the
simple relaxation heuristic presented allows ChainMap to
produce consistent area reductions. Area reductions of up
to 0.71x are witnessed (cfft, K = 5, before) with neutral
speedup, and the potential to increase speed through shorter
wires. Optimal solutions, while prohibitive from an area
standpoint, indicate that better relaxation techniques have
the potential to yield ubiquitous speedup increases. The
results presented in this work are indicative of LEs which
can operate in both (K-1) and K-LUT modes, as depicted in
Fig. 2, and supported by the Stratix and [5]. The ChainMap
algorithm can be adapted to support pure carry-select, (K-
1)-LUT chains, by searching for a (K-1)-feasible cut when
d 6= t, and a K-feasible cut when d = t.
The average performance difference between disregarding
HDL macros completely and inserting chains before map-
ping is within 5%, indicating HDL preservation might poten-
tially be abandoned. This could affect the entire FPGA de-
sign flow, allowing CAD designers to expand algorithms past
the partitions created by HDL. Since the best area/speedup
is usually achieved by the insertion of arithmetic chains be-
fore mapping, the inference is that they are already highly
optimized in terms of literal count, and resynthesis creates
sub-optimality. ChainMap demonstrates that generic logic
chains perform better than solely arithmetic ones, a result
that could lead to innovative FPGA architectures.
Future work includes complete design flow experiments
including place and route, an assessment of power consump-
tion and routing congestion, better relaxation techniques to
unlock the potential indicated by optimal solutions, applica-
tion of ChainMap to non carry-select architectures, and an
exploration of the changes that ChainMap can have on CAD
tools and FPGA architectures. The characteristics of net-
works that perform best under ChainMap will be defined to
guide the improvement of logic chain synthesis techniques.
By rethinking technology mapping as an exercise in the
minimization of routing depth rather than logic depth, Chain-
Map is able to achieve significant performance gains for all
designs. Arithmetic HDL macros can be discarded in fa-
vor of allowing the CAD flow to decide when and where
logic chains should be created in a Boolean network. With
this approach, both FPGA hardware and software can move
beyond the arithmetic constraint, and start considering all
chains as having been created equal.
5. REFERENCES
[1] Altera. Stratix Series User Guides. www.altera.com.
[2] J. Cong and Y. Ding. FlowMap: an optimal technology
mapping algorithm for delay optimization in
lookup-table based FPGA designs. IEEE Transactions
on Computer-Aided Design of Integrated Circuits and
Systems, 13(1):1–12, 1994.
[3] A. Farrahi and M. Sarrafzadeh. Complexity of the
lookup-table minimization problem for fpga technology
mapping. IEEE Transactions On Computer-Aided
Design Of Integrated Circuits And Systems,
13(11):1319–1332, 1994.
[4] L. R. Ford and D. R. Fulkerson. Flows in Networks.
Princeton Univ. Press, Princeton, NJ, 1962.
[5] M. Frederick and A. Somani. Non-arithmetic carry
chains for reconfigurable fabrics. In Proceedings of the
15th International Conference on Computer Design,
pages 137–143, October 2007.
[6] S. Malhotra, T. Borer, D. Singh, and S. Brown. The
quartus university interface program: enabling
advanced fpga research. In Proceedings of the 2004
IEEE Int’l Conference on Field-Programmable
Technology, pages 225–230, Dec. 2004.
[7] OpenCores. www.opencores.org.
[8] E. Sentovich, K. Singh, L. Lavagno, C. Moon,
R. Murgai, A. Saldanha, H. Savoj, P. Stephan, R. K.
Brayton, and A. L. Sangiovanni-Vincentelli. Sis: A
system for sequential circuit synthesis. Technical
Report UCB/ERL M92/41, EECS Department,
University of California, Berkeley, 1992.
[9] S. Singh, J. Rose, P. Chow, and D. Lewis. The effect of
logic block architecture on fpga performance. Journal
of Solid-State Circuits, 27:281–287, March 1992.
Table 1: Performance Summary for OpenCores Benchmarks, K=4
Normal Forget Before After
Design G L LUT Go Lo SUo G L SU LUT λ Go Lo SUo G L SU LUT λ Go Lo SUo G L SU LUT λ
cfft 5 4 4639 3 38 1.06 5 37 0.96 3958 0.85 3 38 1.06 5 37 0.96 3764 0.81 4 6 1.00 4 6 1.00 4640 1.00
mlt3x3 2 34 901 2 33 1.03 3 32 0.98 997 1.11 2 33 1.03 3 16 1.03 850 0.94 2 34 1.00 2 34 1.00 901 1.00
reedsol 9 8 1411 7 9 1.17 10 9 0.90 1401 0.99 7 9 1.17 10 9 0.90 1401 0.99 7 9 1.17 10 9 0.90 1407 1.00
jpeg 6 15 5890 4 8 1.32 7 8 1.14 6546 1.11 4 7 1.32 6 6 1.18 5493 0.93 5 16 1.06 6 16 0.97 5916 1.00
dct 4 8 4767 2 19 1.12 3 19 1.00 5597 1.17 2 19 1.12 3 19 1.00 4572 0.96 3 8 1.00 4 8 1.00 4767 1.00
eth 7 6 301 4 6 2.00 6 5 1.50 307 1.02 4 6 2.00 6 5 1.50 302 1.00 5 9 1.13 6 5 1.03 325 1.08
usb 8 7 3587 5 8 1.35 6 8 1.19 3609 1.01 5 8 1.35 6 8 1.19 3569 0.99 5 8 1.35 6 8 1.19 3738 1.04
xtea 6 36 982 4 36 1.13 7 34 0.98 1163 1.18 4 36 1.13 5 36 1.06 1034 1.05 5 31 1.10 5 31 1.10 990 1.01
des3 7 6 946 6 7 1.08 6 7 1.08 1056 1.12 6 7 1.08 6 7 1.08 1056 1.12 6 7 1.08 6 7 1.08 1064 1.12
rsa 7 39 1227 4 36 1.25 7 35 1.07 1234 1.01 4 36 1.25 7 35 1.07 1002 0.82 6 38 1.07 6 39 1.05 1194 0.97
md5 18 78 2600 13 76 1.15 24 74 0.90 3033 1.17 14 75 1.13 22 71 0.96 2872 1.10 15 38 1.08 18 68 1.03 2838 1.09
sha512 8 70 5908 7 72 1.01 12 70 0.89 6702 1.13 7 69 1.04 11 68 0.93 5855 0.99 8 70 1.00 8 70 1.00 5780 0.98
twofish 55 54 2748 20 64 1.77 26 55 1.65 3696 1.34 20 64 1.77 26 55 1.65 3696 1.34 20 64 1.77 26 55 1.65 3696 1.34
ava 30 34 13670 8 26 2.48 19 29 1.44 14543 1.06 8 26 2.48 19 29 1.44 14894 1.09 8 26 2.48 19 34 1.36 14772 1.08
aes128 15 14 13286 9 15 1.37 12 16 1.13 15311 1.15 9 15 1.37 12 16 1.13 15311 1.15 9 15 1.37 12 16 1.13 15311 1.15
Total 187 413 62863 98 453 – 153 438 – 69153 – 99 448 – 147 417 – 65671 – 108 379 – 138 406 – 67339 –
Ratio 1.00 1.00 1.00 0.52 1.10 1.36 0.82 1.06 1.14 1.10 1.10 0.53 1.08 1.36 0.79 1.01 1.17 1.04 1.04 0.58 0.92 1.28 0.74 0.98 1.16 1.07 1.07
Table 2: Performance Summary for OpenCores Benchmarks, K=5
Normal Forget Before After
Design G L LUT Go Lo SUo G L SU LUT λ Go Lo SUo G L SU LUT λ Go Lo SUo G L SU LUT λ
cfft 4 6 4749 3 36 1.11 5 34 1.02 3840 0.81 3 36 1.11 5 35 1.00 3357 0.71 3 41 1.00 5 5 1.00 4639 0.98
mlt3x3 2 34 901 2 17 1.74 3 17 1.54 760 0.84 2 17 1.74 3 17 1.54 754 0.84 2 34 1.00 2 34 1.00 901 1.00
reedsol 7 6 1231 5 11 1.04 7 9 0.90 1217 0.99 5 11 1.04 7 9 0.90 1217 0.99 5 11 1.04 7 9 0.90 1225 1.00
jpeg 6 13 5875 4 6 1.72 6 6 1.29 5059 0.86 4 6 1.72 6 6 1.29 4916 0.84 5 16 1.00 6 16 0.91 5887 1.00
dct 4 8 4767 2 11 1.65 3 10 1.47 4123 0.86 2 11 1.65 3 10 1.47 4059 0.85 3 8 1.00 4 8 1.00 4767 1.00
eth 6 5 258 4 9 1.62 6 6 1.42 242 0.94 4 9 1.62 6 6 1.42 242 0.94 4 20 1.06 6 8 1.00 267 1.03
usb 8 7 3111 5 8 1.35 6 8 1.19 3211 1.03 5 8 1.35 6 8 1.19 3186 1.02 5 8 1.35 6 8 1.19 3387 1.09
xtea 6 36 1009 4 30 1.29 6 29 1.15 900 0.89 4 30 1.29 6 29 1.15 910 0.90 5 32 1.13 5 32 1.13 974 0.97
des3 7 6 824 5 8 1.17 6 8 1.04 993 1.21 5 8 1.17 6 8 1.04 993 1.21 5 8 1.17 6 8 1.04 1002 1.22
rsa 6 38 1132 4 21 1.70 7 20 1.37 928 0.82 4 21 1.70 7 19 1.40 912 0.81 5 39 1.04 6 38 1.00 1135 1.00
md5 18 58 2569 12 52 1.41 21 26 1.11 2498 0.97 12 51 1.43 21 41 1.12 2465 0.96 15 51 1.06 16 75 1.01 2517 0.98
sha512 8 70 5518 6 68 1.09 10 65 0.99 4854 0.88 6 68 1.09 9 66 1.01 4828 0.87 8 70 1.00 8 70 1.00 5358 0.97
twofish 50 49 2602 13 60 2.01 23 56 1.59 3100 1.19 13 60 2.01 23 56 1.59 3100 1.19 13 60 2.01 23 56 1.59 3100 1.19
ava 22 26 13415 8 24 1.92 11 19 1.77 11807 0.88 8 24 1.92 11 19 1.77 11989 0.89 8 24 1.92 11 19 1.77 12501 0.93
aes128 13 12 11939 7 16 1.38 11 13 1.11 12703 1.06 7 16 1.38 11 13 1.11 12703 1.06 7 16 1.38 11 13 1.11 12703 1.06
Total 167 374 59900 84 377 – 131 326 – 56235 – 84 376 – 130 342 – 55631 – 93 438 – 122 399 – 60363 –
Ratio 1.00 1.00 1.00 0.50 1.01 1.49 0.78 0.87 1.26 0.94 0.94 0.50 1.01 1.49 0.78 0.91 1.27 0.93 0.93 0.56 1.17 1.25 0.73 1.07 1.16 1.01 1.01
Table 3: Performance Summary for OpenCores Benchmarks, K=6
Normal Forget Before After
Design G L LUT Go Lo SUo G L SU LUT λ Go Lo SUo G L SU LUT λ Go Lo SUo G L SU LUT λ
cfft 4 6 4740 3 19 1.79 5 18 1.52 3162 0.67 3 20 1.72 5 19 1.47 3060 0.65 3 41 1.00 4 3 1.00 4620 0.97
mlt3x3 2 34 901 2 17 1.74 3 16 1.60 817 0.91 2 17 1.74 3 16 1.60 748 0.83 2 34 1.00 2 34 1.00 901 1.00
reedsol 6 5 1212 5 6 1.10 5 6 1.10 1130 0.93 5 6 1.10 5 6 1.10 1130 0.93 5 6 1.10 5 6 1.10 1138 0.94
jpeg 5 14 5875 4 5 1.71 5 5 1.45 5248 0.89 4 5 1.71 5 5 1.45 4789 0.82 4 14 1.00 5 14 1.00 5877 1.00
dct 3 7 4766 2 10 1.75 3 10 1.47 4441 0.93 2 10 1.75 3 10 1.47 3993 0.84 3 7 1.00 3 7 1.00 4766 1.00
eth 6 5 255 4 7 1.79 5 6 1.62 221 0.87 4 7 1.79 5 6 1.62 218 0.85 4 20 1.06 5 8 1.06 245 0.96
usb 6 5 2815 4 7 1.21 6 6 0.96 2685 0.95 4 7 1.21 6 6 0.96 2662 0.95 4 7 1.05 6 6 0.96 2876 1.02
xtea 5 35 915 4 28 1.25 6 28 1.09 876 0.96 4 28 1.25 6 28 1.09 746 0.82 5 31 1.06 5 31 1.06 912 1.00
des3 4 3 347 4 3 1.00 4 3 1.00 338 0.97 4 3 1.00 4 3 1.00 338 0.97 4 3 1.00 4 3 1.00 347 1.00
rsa 6 38 1120 4 19 1.81 7 19 1.40 954 0.85 4 19 1.81 7 19 1.40 814 0.73 5 38 1.06 6 38 1.00 1127 1.01
md5 15 52 1730 11 44 1.47 20 44 1.09 2041 1.18 11 43 1.49 18 43 1.16 1945 1.12 13 73 1.01 15 72 0.97 2129 1.23
sha512 8 70 5362 6 68 1.09 11 15 0.96 4741 0.88 6 68 1.09 8 66 1.04 4492 0.84 8 70 1.00 8 70 1.00 5118 0.95
twofish 40 39 2559 13 57 1.66 23 45 1.36 2797 1.09 13 57 1.66 23 45 1.36 2797 1.09 13 57 1.66 23 45 1.36 2797 1.09
ava 17 21 10394 8 19 1.67 10 18 1.50 9960 0.96 8 19 1.67 10 18 1.50 10269 0.99 8 19 1.67 10 18 1.50 10708 1.03
aes128 9 8 3921 6 12 1.17 8 13 0.95 4777 1.22 6 12 1.17 8 13 0.95 4777 1.22 6 12 1.17 8 13 0.95 4777 1.22
Total 136 342 46912 80 321 – 121 252 – 44188 – 80 321 – 116 303 – 42778 – 87 432 – 109 368 – 48338 –
Ratio 1.00 1.00 1.00 0.59 0.94 1.46 0.89 0.74 1.23 0.94 0.94 0.59 0.94 1.46 0.85 0.89 1.26 0.91 0.91 0.64 1.26 1.15 0.80 1.08 1.09 1.03 1.03
