Analysis and synthesis of weighted-sum functions by Sasao, Tsutomu
Kyushu Institute of Technology Academic Repository
九州工業大学学術機関リポジトリ
TitleAnalysis and synthesis of weighted-sum functions
Author(s)Sa ao, Tsutomu
Issue Date2006-05
URL http://hdl.handle.net/10228/611
Rights
"©2006 IEEE. Personal use of this material is permitted.
However, permission to reprint/republish this material for
advertising or promotional purposes or for creating new
collective works for resale or redistribution to servers or lists,
or to reuse any copyrighted component of this work in other
works must be obtained from the IEEE."
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 25, NO. 5, MAY 2006 789
Short Papers
Analysis and Synthesis of Weighted-Sum Functions
Tsutomu Sasao
Abstract—A weighted-sum (WS) function computes the sum of selected
integers. This paper considers a design method for WS functions by
look-up table (LUT) cascades. In particular, it derives upper bounds on
the column multiplicities of decomposition charts for WS functions. From
these, the size of LUT cascades that realize WS functions can be estimated.
The arithmetic decomposition of a WS function is also shown. With this
method, a WS function can be implemented with cascades and adders.
Index Terms—Binary decision diagram, column multiplicity, complexity
of logic functions, digital filter, distributed arithmetic, field programmable
gate array (FPGA), functional decomposition, LUT cascades, radix con-
verter, symmetric function, threshold function.
I. INTRODUCTION
A weighted-sum (WS) function computes the sum of selected
integers WS(x0, x1, . . . , xn−1) =
∑n−1
i=0
wixi, where wi are inte-
ger weights and xi are binary variables. The WS function is a
mathematical model of various computations: bit counting circuits,
radix converters [10], distributed arithmetic (DA) for convolution
operation, etc.
The look-up table (LUT) cascade has a regular programmable
structure that implements many practical functions efficiently [7], [8].
In this paper, we derive upper bounds on the column multiplicities
of decomposition charts for WS functions. With these bounds, we can
estimate the size of a circuit for the consecutive outputs of the WS
function and efficiently realize WS functions with LUT cascades.
This paper is organized as follows. Section II defines WS functions
and shows the properties of WS functions. Section III shows the
method to implement WS functions with LUT cascades. Section IV
shows applications of WS functions. Section V shows the arithmetic
decomposition of the WS function. Section VI concludes this paper.
II. WS FUNCTIONS
A WS function is a mathematical model of bit counting circuits,
code converters, DA, etc.
Deﬁnition 2.1: An n-input WS function F ( X) computes
WS( X) =
n−1∑
i=0
wixi. (2.1)
Here, X = (x0, x1, . . . , xn−1) is a binary input vector and W =
(w0, w1, . . . , wn−1) is the weight vector, where wi (i = 0, 1, . . . ,
Manuscript received June 26, 2005; revised September 28, 2005. This work
was supported in part by the Grants in Aid for Scientific Research of Japan
Society for the Promotion of Science (JSPS) and Ministry of Education, Cul-
ture, Sports, Science and Technology (MEXT) and by a Kitakyushu Innovative
Cluster Project grant. This paper was presented in part at the International
Workshop on Logic and Synthesis, 2005. This paper was recommended by
Guest Editor R. I. Bahar.
The author is with the Department of Computer Science and Electronics,
Kyushu Institute of Technology, Iizuka 820-8502, Japan (e-mail: sasao@
cse.kyutech.ac.jp).
Digital Object Identifier 10.1109/TCAD.2006.870407
TABLE I
EXAMPLE OF WS FUNCTION
n− 1) is an integer. Let F = (fq−1, fq−2, . . . , f0) be the binary
representation of the WS function. Then
WS( X) =
q−1∑
i=0
fi( X)2
i. (2.2)
Example 2.1: Consider the case where n = 4 and W =
(w0, w1, w2, w3) = (1, 2, 3, 4). Table I shows X , WS( X), and F =
(f3, f2, f1, f0).
Deﬁnition 2.2: Consider a function F ( X) : Bn → Bq , where
B = {0, 1}. Let ( XL, XH) be a partition of X , where XL =
(x0, x1, . . . , xnL−1) and XH = (xnL , xnL+1, . . . , xn−1). The de-
composition chart for f is a two-dimensional matrix where the
column labels have all possible assignments of values to variables in
XL, the row labels have all possible assignments of values to variables
in XH , and the corresponding matrix value is equal to F ( XL, XH).
Among the decomposition charts for F , the one whose column label
values and row label values increase when the label moves from left
to right and from top to bottom is the standard decomposition chart.
The number of different column patterns in the decomposition chart
is the column multiplicity. XL denotes bound variables, while XH
denotes free variables [9].
Note that, in an ordinary decomposition chart, the partitions of
variables and the order of labels in the columns and rows are ar-
bitrary. However, in the standard decomposition chart, the labels of
the columns are in increasing order of XL = (x0, x1, . . . , xnL−1)
and the labels of the rows are in increasing order of XH =
(xnL , xnL+1, . . . , xn−1).
Example 2.2: Table II shows an example of a decomposition chart
for n = 5, where XL = (x0, x1, x2) and XH = (x3, x4). Suppose
that q = 2, that is, only two least significant bits are considered. Note
that each element is a binary vector of 2 bits. In this case, only four
different vectors can exist. So, in the first row of the decomposition
chart, that is, the row for (x3, x4) = (0, 0), at least two elements
are equal. Suppose that the values for the columns (x0, x1, x2) =
(0, 1, 1) and (x0, x1, x2) = (1, 0, 0) are equal: w1 + w2 = w0. This
0278-0070/$20.00 © 2006 IEEE
790 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 25, NO. 5, MAY 2006
TABLE II
DECOMPOSITION CHART FOR WS FUNCTION
implies that in the second row of the decomposition chart, that is,
in the row for (x3, x4) = (0, 1), the corresponding two elements are
equal: w1 + w2 + w4 = w0 + w4. This is obvious since the same
numbers are added to both sides of the equation. In similar ways,
we can show that in the remaining rows, the entries for the columns
(x0, x1, x2) = (0, 1, 1) and (x0, x1, x2) = (1, 0, 0) are equal. That is,
if the two elements in the first row are equal, then the patterns of the
two columns are the same. Hence, we can see that the column patterns
for (x0, x1, x2) = (0, 1, 1) and (x0, x1, x2) = (1, 0, 0) are the same.
From the above example we have the following.
Lemma 2.1: The column multiplicity of a decomposition chart of
a WS function is equal to the number of different elements in the
first row.
Furthermore, we have the following.
Lemma 2.2: The column multiplicity of the decomposition chart for
a q-output WS function is at most 2q .
Proof: Let F (x0, x1, . . . , xn−1) be an n-input q-output WS
function. Let ( XL, XH) be a partition of X = (x0, x1, . . . , xn−1),
where XL = (x0, x1, . . . , xnL−1) and XH = (xnL , xnl+1, . . . ,
xn−1). Consider the decomposition chart of F , where XL denotes
the bound variables and XH denotes the free variables.
When nL ≤ q, there can be no more than 2q columns, and the
lemma follows. Consider nL > q. In the first row of the decomposition
chart, i.e., the row for XH = (0, 0, . . . , 0), the number of different
elements is at most 2q since each element of the decomposition chart
is a vector of q bits. Thus, there exist two different vectors a and
b ∈ {0, 1}nL such that F (a,0) = F (b,0).
Next, consider the jth row (j > 0). Let c be the value of XH =
(xnL , xnL+1, . . . , xn−1). Then, by Definition 2.1, F satisfies
F (a,c) = F (a,0) + F (0,c)
F (b,c) = F (b,0) + F (0,c)
where the symbol + denotes the addition of binary vectors that allow to
carry propagations. Therefore, we have the relation F (a,c)= F (b,c).
Since this relation holds for all j > 0, two column patterns that
correspond to vectors a andb are the same.
From above, we can conclude that the column multiplicity of the
decomposition chart is at most 2q . 
Theorem 2.1: Let F ( X) be a WS function. Let ( XL, XH) be a par-
tition of X = (x0, x1, . . . , xn−1), where XL = (x0, x1, . . . , xnL−1)
and XH = (xnL , xnL+1, . . . , xn−1). Consider the decomposition
chart of F , where XL denotes bound variables and XH denotes
free variables. Let W = (w0, w1, . . . , wn−1) be the weight vector.
Then, the column multiplicity of the decomposition chart is at most
UB1 = 1 +
∑nL−1
j=0
|wj |, where nL denotes the number of variables
in XL.
TABLE III
DECOMPOSITION CHART OF WS FUNCTION
(INTEGER REPRESENTATION)
Proof: Consider the decomposition chart for WS( XL, XH). In
the first row of the decomposition chart, XH = (0, 0, . . . , 0). Note that
the column multiplicity is equal to the number of different values in the
first row.
Consider the case where all the weights are positive. In this case,
the number of different values is at most UB1, since WS takes values
from 0 to
∑nL−1
j=0
wj .
Consider the case where some of the weights are negative. Assume
that w0, w1, . . . , wt−1 are negative, and wt, wt+1, . . . , wnL−1 are
positive. Then, the WS takes values from
∑t−1
j=0
wj to
∑nL−1
j=t
wj .
In this case, the number of different values is at most 1 +∑t−1
j=0
|wj |+
∑nL−1
j=t
wj = 1 +
∑nL−1
j=0
|wj |. From these, we can
conclude that the column multiplicity of the decomposition chart
is at most UB1. 
Example 2.3: Consider the case where n = 5 and W =
(w0, w1, w2, w3, w4) = (1, 2, 3, 4, 5). Let XL = (x0, x1, x2) and
XH = (x3, x4). In this case, UB1 = 1 + w0 + w1 + w2 = 1 + 1 +
2 + 3 = 7. Table III shows the decomposition chart of the function.
Note that the column multiplicity of the decomposition chart is 7.
So, the bound UB1 is tight.
A WS function usually has many outputs. When it is implemented
as a monolithic circuit, it can be very large. However, if we partition
the outputs into groups and implement each group separately, then
the whole circuit can be smaller. The next two theorems give upper
bounds on the column multiplicity for the block for the least significant
i bits (LSBLOCK) and the block for the most significant (q − i) bits
(MSBLOCK). These bounds estimate the sizes of component circuits.
Theorem 2.2: Let FLSB( X) be the logic function that represents
the least significant i bits of a WS function. Then, the column mul-
tiplicity of the standard decomposition chart for FLSB( X) is at most
UB2 = 2i.
Proof: Let FLSB( X) be the integer represented by the least
significant i bits of the function. Then, we have
FLSB( X) = WS( X) (mod 2i).
Since the column is computed in modulo 2i, we can omit the most
significant (q − i) bits and leave only the least significant i bits. From
Lemma 2.2, the number of different column patterns is at most 2i.
Hence, the column multiplicity of the standard decomposition chart is
at most 2i. 
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 25, NO. 5, MAY 2006 791
TABLE IV
DECOMPOSITION CHART OF WS FUNCTION (BINARY REPRESENTATION).
(a) ALL FOUR BITS. (b) LEAST SIGNIFICANT TWO BITS.
(c) MOST SIGNIFICANT TWO BITS
Deﬁnition 2.3: Let α be a real number. The largest integer that is
not greater than α is denoted by α	, and the smallest integer that is
equal to or greater than α is denoted by 
α.
Theorem 2.3: Let FMSB( X) be the function that represents the
ith to the most significant bits of a WS function. Then, the column
multiplicity of the standard decomposition chart for FMSB( X) is
at most
UB3 = n−1max
nL=1
[
min
{
2nL ,
(⌊∑nL−1
j=0
|wj |
2i
⌋
+ 1
)
2n−nL
}]
(2.3)
where W = (w0, w1, . . . , wn−1) is the weight vector. Here, the least
significant bit is the 0th bit.
Proof: Let XL = (x0, x1, . . . , xnL−1) be the bound variables,
and let XH = (xnL , xnL+1, . . . , xn−1) be the free variables of the
standard decomposition chart. Let nL be the number of bound vari-
ables and nH be the number of free variables. It is clear that column
multiplicity is at most 2nL , the total number of the columns. The
maximal number represented from the ith bit to the most significant
bit is p = ∑nL−1
j=0
|wj |/2i	. So, we can regard it as a (p+ 1) valued
function g : Bn → {0, 1, . . . , p}. Reorder the bound variables so that
moving from left to right in the decomposition chart will not decrease
the value of the function g. In this case, the number of changes of
the columns in a row is at most p+ 1. Since there are 2nH rows, the
column multiplicity is at most 2nH (p+ 1), where nH = n− nL.
Hence, we have the theorem. 
Example 2.4: Consider the case where n = 5 and W = (w0,
w1, w2, w3, w4) = (1, 2, 3, 4, 5). Let XL = (x0, x1, x2) and XH =
(x3, x4). Table IV(a) shows the decomposition chart, where the
function values are represented by binary numbers. Table IV(b) is
the decomposition chart of the least significant 2 bits. The column
multiplicity is 4. Theorem 2.2 shows that the upper bound on the
number of the column multiplicity is UB2 = 22 = 4. So, the bound
UB2 is tight.
Table IV(c) is the decomposition chart of the most significant 2 bits.
The column multiplicity is 3. Theorem 2.3 shows that the upper bound
on the number of the column multiplicity is
UB3 = 4max
nL=1
[
min
(
2nL,
(⌊∑nL−1
j=0
wj
22
⌋
+1
)
25−nL
)]
= max
[
min
(
21,
(⌊
w0
22
⌋
+1
)
24
)
,
min
(
22,
(⌊
w0+w1
22
⌋
+1
)
23
)
,
min
(
23,
(⌊
w0+w1+w2
22
⌋
+1
)
22
)
,
min
(
24,
(⌊
w0+w1+w2+w3
22
⌋
+1
)
21
)]
= max(2, 4, 8, 6) = 8.
Note that as shown in Table V, when XL = (x0, x1, x2, x3),XH =
(x4), the decomposition chart has a maximal column multiplicity of 5.
In this case, the upper bound UB3 is not tight.
III. LUT CASCADE
An arbitrary logic function can be implemented by a single memory.
However, as the number of input variables increases, the size of the
memory increases exponentially.
In general, practical functions often have decomposition charts with
small column multiplicities.
Theorem 3.1: For a given function f , let XL be the variables for
the columns, let XH be the variables for the rows, and let µ be the
column multiplicity of the decomposition chart. Then, the function f
is realizable with the network shown in Fig. 1. In this case, the number
of (two-valued) signal lines that connect two blocks H and G is

log2 µ [2].
When the number of signal lines that connect two blocks is smaller
than the number of variables in XL, we can often reduce the size of
memory to implement the function. This technique is a functional
decomposition.
By applying functional decomposition repeatedly to the given func-
tion, we have the LUT cascade shown in Fig. 2.
The cascade consists of cells, and the wires connecting adjacent
cells are rails. Functions with small column multiplicities have com-
pact LUT cascade realizations. To derive column multiplicities, we
need not use decomposition charts. We can efficiently obtain col-
umn multiplicity by a binary decision diagram (BDD_for_CF) that
represents the characteristic function for the multiple-output func-
tion [8], [11].
Theorem 3.2: Let µ be the maximum width of the BDD for the func-
tion f . Then, f can be implemented by the LUT cascade consisting of
cells with at most 
log2 µ+ 1 inputs and at most 
log2 µ outputs
[7].
Corollary 3.1: Let FLSB( X) be the logic function that represents
the least significant q bits of a WS function. Then, FLSB( X) can be
realized with the LUT cascade consisting of cells with at most q + 1
inputs and at most q outputs.
Corollary 3.2: Let the number of outputs of a WS function be
q. Then, the WS function can be realized with the LUT cascade
consisting of cells with at most q + 1 inputs and at most q outputs.
Theorem 3.3: Consider an LUT cascade for a function f . Let n
be the number of primary inputs, s be the number of cells, r be the
maximum number of rails (i.e., the number of lines between cells),
k be the maximum number of inputs of a cell, µ be the maximum
792 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 25, NO. 5, MAY 2006
TABLE V
DECOMPOSITION CHART OF MOST SIGNIFICANT TWO BITS OF WS FUNCTION
Fig. 1. Realization of logic functions by decomposition.
Fig. 2. LUT cascade with intermediate outputs.
width of the BDD for f , and k ≥ 
log2 µ+ 1. Then, there is an LUT
cascade for f that satisfies the relation
s ≤
⌈
n− r
k − r
⌉
. (3.1)
Proof: From the design method of the LUT cascade, we have
k + (k − r)(s− 1) ≤ n.
Here, k in the left-hand side of the inequality denotes the number
of inputs of the left-most LUT, and (k − r)(s− 1) denotes the sum
of inputs for the remaining (s− 1) LUTs. When the actual number of
rails is smaller than r, we append dummy rails to make the number of
rails r. From this, we have
s− 1 ≤ n− k
k − r and s ≤
n− r
k − r .
Since s is an integer, we have (3.1). When this inequality holds,
we can realize an LUT cascade for f having cells with at most k
inputs. 
IV. APPLICATIONS OF WS FUNCTIONS
In this part, we consider designs of bit counting circuits, ternary-
to-binary converters, decimal-to-binary converters, and finite impulse
response (FIR) filter. We also show the application to threshold
functions.
A. Bit Counting Circuit
The bit counting function WGT n [6] is the simplest example of an
n-input WS function. It counts the number of ones in the inputs and
represents it by a binary number.
Example 4.1: Assume that n = 16. Then, we have W = (w0,
w1, . . . , w15) = (1, 1, . . . , 1). Let F = (f4, f3, f2, f1, f0) be the
Fig. 3. Bit counting function WGT16.
outputs of the WS function, then we can show that [6]
f4 =x0 · x1 · · ·x15
f3 =
∑
i1<i2<···<i8
⊕ xi1 · xi2 · xi3 · · ·xi8
f2 =
∑
i1<i2<i3<i4
⊕ xi1 · xi2 · xi3 · xi4
f1 =
∑
i1<i2
⊕ xi1 · xi2
f0 =x0 ⊕ x1 ⊕ · · · ⊕ x15
where i1, i2, . . . , i8 ∈ { 0, 1, 2, . . . , 15 }. Since nL ≤ 15, by
Theorem 2.1, we can see that the column multiplicity of the
decomposition chart is at most UB1 = 1 +
∑14
j=0
1 = 1 + 15 = 16.
By Theorem 3.2, this function can be realized by a single cascade
with five-input four-output cells. If the outputs are partitioned into
(f1, f0) and (f4, f3, f2), and realize them by the LSBLOCK and the
MSBLOCK, respectively, then the column multiplicities for them are
4 and 14, respectively (see Table VII).
Fig. 3 shows the cascades for WGT16, where each cell has at most
ten inputs. The upper cascade corresponds to LSBLOCK and realizes
the least significant 2 bits (f1, f0). By Theorem 3.1, the number of
outputs for the first cell is two since 
log2 4 = 2. The lower cascade
corresponds to MSBLOCK and realizes the most significant 3 bits
(f4, f3, f2). By Theorem 3.1, the number of outputs for the first cell is
four since 
log2 14 = 4. For this function, we can obtain the cascade
structure from the number of inputs of cells.
B. Ternary-to-Binary Converter
Let F = (fq−1, fq−2, . . . , f0) be the output of a ternary-to-binary
converter. Then, in general, fi depends on all the inputs xj (j =
0, 1, . . . , n− 1). For ternary-to-binary converters, we use the binary-
coded ternary code to represent a ternary digit. That is, zero is
represented by (00); one is represented by (01); and two is represented
by (10). (11) is an unused code. In the decomposition chart, the
input variables are grouped into pairs. The truth table of the two-
digit ternary to a 4-bit binary converter is shown in Table VI. In
this case, (11) is an undefined input, and the corresponding outputs
are don’t cares. In Table VI, the binary-coded ternary representa-
tion is denoted by X = (x0, x1, x2, x3), the ternary representation is
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 25, NO. 5, MAY 2006 793
TABLE VI
TRUTH TABLE FOR A TERNARY-TO-BINARY CONVERTER
Fig. 4. Eight-digit ternary-to-binary converter.
denoted by T = (t1, t0), and the binary representation is denoted by
F = (f3, f2, f1, f0). When we implement this converter by a WS
function, the weight vector is W = (w0, w1, w2, w3) = (1, 2, 3, 6).
In this case, the function is completely specified. For example, for
the input (x0, x1, x2, x3) = (1, 1, 1, 1), the output is (1, 1, 0, 0) since
WS = 1 + 2 + 3 + 6 = 12.
Example 4.2: Consider an eight-digit ternary-to-binary converter.
Since a ternary digit requires 2 bits, the total number of inputs is
2× 8 = 16. Further, the number of output bits is 13. To implement
the converter by a WS function, the weight vector should be W =
(1, 2, 3, 6, 9, 18, 27, 54, 81, 162, 243, 486, 729,1458, 2187, 4374).
The column multiplicity of this function is bounded above by
1 +
∑14
i=0
wi = 5468. This suggests that the function is unsuitable
for a single cascade realization. So we will implement it by a
pair of cascades. Assuming that we use cells with 11 inputs, we
have the cascade realization shown in Fig. 4. The upper cascade
LSBLOCK realizes the least significant seven bits, while the lower
cascade MSBLOCK realizes the most significant seven bits. From
Theorem 2.2, the column multiplicity of the decomposition chart for
the LSBLOCK is at most 27 = 128. Thus, the number of rails for
the LSBLOCK is 
log2 128 = 7. From Theorem 2.3, the column
multiplicity of the decomposition chart for the MSBLOCK is at most
128. Thus, the number of rails is 
log2 128 = 7.
From Fig. 4, we can see that the amount of memory needed for
the cascades is 7(211 + 211 + 28 + 211 + 211 + 28) = 32 256 (bits),
which is much smaller than the single memory realization. Note that
the most significant bit, i.e., 14th bit, is not used for valid inputs and
can be omitted. The single memory realization requires 216 × 13 =
851 968 (bits).
C. Decimal-to-Binary Converter
In this part, we consider the design of various decimal-to-binary
converters.
Example 4.3: Consider a five-digit decimal to binary converter.
When the decimal numbers are represented by the 8421 BCD code,
the number of binary inputs is 4× 5 = 20.
Suppose that we realize it by the WS function with the weight vector
W = (1, 2, 4, 8, 10, 20, 40, 80, 100, 200, 400, 800, 1000, 2000, 4000,
Fig. 5. Five-digit decimal-to-binary converter (natural ordering).
TABLE VII
UPPER BOUNDS AND ACTUAL NUMBERS OF
COLUMN MULTIPLICITIES
Fig. 6. Five-digit decimal-to-binary converter (optimal ordering).
8000, 10 000, 20 000, 40 000, 80 000). We use two LUT cascades to
implement the function: the LSBLOCK realizes the least significant
nine bits and the MSBLOCK realizes the most significant nine bits.
From Theorem 2.2, we can see that the column multiplicity for the
LSBLOCK is at most 29 = 512. From Theorem 2.3, we can see
that the column multiplicity for the MSBLOCK is at most 1024.
So, we can implement these blocks by using a cascade with cells of
at most 11 inputs. With 12-input cells, we can implement the WS
function consisting of a pair of cascades as shown in Fig. 5. As shown
in Table VII, the actual column multiplicity for the MSBLOCK is
521. So, the bound UB3 is not tight. However, it is still useful since

log2 1024 = 
log2 521 = 10.
In the case of the decimal-to-binary converter, some outputs depend
on only a part of the inputs. Especially, f0 = x0. That is, the least
significant bit depends on only x0. Also, the MSBLOCK does not
depend on x0. When we change the ordering of the inputs and outputs,
we have smaller cascades shown in Fig. 6. Note that in the LSBLOCK,
three outputs {f3, f2, f1} depend on only 12 inputs.
Example 4.4: Table VIII shows the 5211, 2421, and 84-2-1 codes,
where the 9’s complements are easily obtained. Similar to the 8421
code, we can design converters for 5211, 2421, and 84-2-1 codes.
794 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 25, NO. 5, MAY 2006
TABLE VIII
VARIOUS CODES FOR DECIMAL-TO-BINARY CONVERTERS
To design these cascades, we used weights as follows: W = (1, 1, 2,
5, 10, 10, 20, 50, 100, 100, 200, 500, 1000,1000, 2000, 5000, 10 000,
10 000, 20 000, 50 000). W =(1, 2, 2, 4, 10, 20, 20, 40, 100, 200, 200,
400, 1000, 2000, 2000, 4000, 10 000, 20 000, 20 000, 40 000). W =
(−1, −2, 4, 8, −10, −20, 40, 80, −100, −200, 400, 800, −1000,
−2000, 4000, 8000,−10 000,−20 000, 40 000, 80 000).
Again, we use two modules to implement code converters, namely,
1) the LSBLOCK realizes the least significant nine bits and 2) the
MSBLOCK realizes the most significant nine bits. Table VII shows
the upper bounds obtained from Theorems 2.2 and 2.3 and the actual
numbers for the column multiplicities. Note that the ordering of the
variables are fixed to (x0, x1, . . . , xn−1). For the LSBLOCKs, if we
reorder the variables, the column multiplicities are greatly reduced.
D. FIR Filter
Digital filters are important elements in signal processing [4] and
can be classified into two types, namely, FIR filters and infinite impulse
response (IIR) filters. FIR filters implement nonrecursive structures
and so always have stable operations. Also, FIR filters can have linear
phase characteristics, so they are useful for waveform transmission.
To realize FIR filters, we can use DA to convert the multiply
accumulation operations into table-lookup operations [3], [14]. In this
part, we consider an implementation of the DA of the FIR filter by
an LUT cascade. The LUT cascade realization requires much smaller
memory than the single memory realization. The structure of the FIR
filter mainly depends on the number of taps N , the number of bits
in the outputs q, and the number of inputs k of the cells in the LUT
cascade.
Deﬁnition 4.1: The FIR filter computes
Y(n) =
N−1∑
i=0
wiX (n− i) (4.1)
where X (i) is the value of the input X at the time i and Y(i) is
the value of the output Y at the time i.1 wi is a filter coefficient
represented by a q-bit binary number, and N is the number of taps
in the filter.2
Fig. 7 implements (4.1) directly. It consists of an N -stage q-bit
shift register, N copies of q-bit multipliers, and an adder for
N q-bit numbers. To reduce the amount of hardware in Fig. 7, we use
the bit serial method shown in Fig. 8, where PSC denotes the parallel
to series converter and ACC denotes the shifting accumulator, which
accumulates the numbers while doing shifting operations.
1X andY denote the values of signal in the filters. xi denotes a logic variable
and X1 and X2 denote the vectors of logic variables.
2In general, the number of bits for hi and Y can be different. However, for
simplicity, we assume that they are represented by q bits.
Fig. 7. Parallel realization of FIR filter.
Fig. 8. Serial realization of FIR filter.
Fig. 9. Single memory realization of FIR filter.
In this case, the inputs to w0, w1, . . . , wN−1 are either 0 or 1, and
the multipliers are replaced by AND gates. The combinational part
in Fig. 8 has N -inputs and (
log2N+ q)-outputs. In Fig. 9, the
combinational part is implemented by the memory that realizes the
WS function
WS(x0, x1, . . . , xN−1) =
N−1∑
j=0
wjxj .
This method of computation is known as DA and is often used to
implement convolution operations, since many multipliers and an
adder with many inputs can be replaced by one memory [3], [14].
It is applicable only when the coefficients wi are constants. In FIR
filters, the coefficientswi are constants, so we can apply this method. It
reduces the amount of hardware by 1/q but increases the computation
time by a factor of q.
Example 4.5: Consider a low-pass FIR filter with 33 taps. Sup-
pose that it is symmetric, so we need only to realize the WS func-
tion with 17 inputs [4]. Let the number of output bits be 15 and let
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 25, NO. 5, MAY 2006 795
Fig. 10. FIR filter (optimal ordering).
the filter coefficients be W = (378, 188,−521,−1120,−713, 353,
614,−420,−1168,−100, 1538, 920, −1925, −2720, 2167, 10 164,
14 125). A single memory realization requires 217 · 15 =
1 966 080 bits. Fig. 10 shows the LUT cascades for the filter,
where the LSBLOCK realizes the least significant eight bits and
the MSBLOCK realizes the most significant seven bits. The bounds
obtained from Theorems 2.2 and 2.3 are shown in Table VII. In this
case, the ordering of the input and output variables is optimized.
Especially, the LSBLOCK is reduced drastically since two outputs
depend on only 12 variables. The total amount of memory is
212 × 8 + 211 × 6 + 212 × 10 + 212 × 9 + 212 × 7 = 110 592 bits.
E. Threshold Function
Deﬁnition 4.2: A threshold function f(x0, x1, . . . , xn−1) satisfies
the relation f = 1 if
∑n
i=1
wixi ≥ T , and f = 0 otherwise, where
(w0, w1, . . . , wn−1) are weights and T is the threshold.
Although a threshold function is not a WS function, we can estimate
the column multiplicity of a threshold function from the theory of WS
functions.
Theorem 4.1: The column multiplicity of a decomposition chart of
the threshold function with weights (w0, w1, . . . , wn−1) is at most
UB4 = 1 +
n−1∑
i=0
|wi|. (4.2)
Proof: The column multiplicity of a decomposition chart for f is
not greater than that of the WS function having the same weights. By
Theorem 2.1, the column multiplicity of the WS function is at most
UB4. Hence, we have the theorem. 
Threshold functions are useful for neural nets. So, we can see that
the LUT cascade is promising for neural nets when the sums of weights
are small.
V. ARITHMETIC DECOMPOSITION OF WS FUNCTIONS
In general, a q-output WS function requires (q + 1)-input q-output
cells in a cascade realization. Thus, when q is large, large cells
are required. To implement a WS function with many outputs using
smaller cells, we can decompose the WS function into smaller ones.
Note that this is different from the partition of outputs, where each
group of outputs is implemented by a cascade independently.
A 2q-output WS function can be decomposed into a pair of WS
functions as follows. Let wi be a weight of 2q output WS function.
Then, wi can be written as
wi = 2
qwAi + wBi
where wAi denotes the most significant q bits and wBi denotes the
least significant q bits. In this case, we can implement the 2q output
Fig. 11. Arithmetic decomposition of 2q output WS function.
WS function by using a pair of WS functions and an adder, as shown
in Fig. 11. Note that the adder has 2q inputs and q outputs.
Theorem 5.1: A 2q-output WS function F ( X) that represents
N−1∑
i=0
wixi
can be decomposed into a pair of WS functions FA( X) and FB( X),
where FA( X) is a q-output WS function representing
N−1∑
i=0
wAixi
and FB( X) is a q + 
log2N-output WS function representing
N−1∑
i=0
wBixi
and
wi = 2
qwAi + wBi.
This is an arithmetic decomposition of a WS function.
In a similar way, a 4q-output WS function can be decomposed into
four WS functions as follows. Let wi be a weight of the 4q output WS
function. Then, wi can be written as
wi = 2
3qwAi + 2
2qwBi + 2
qwCi + wDi
where wAi, wBi, wCi, and wDi denote q-bit numbers. As shown in
Fig. 12, we realize the 4q-output WS function by using four q-output
WS functions and adders. Note that block A realizes a q-output WS
function, while blocksB,C, andD realize (q + 
log2N)-output WS
functions.
Note that the output adder has 4q inputs and 2q outputs.
By applying the arithmetic decomposition iteratively, we can imple-
ment any WS function with small cascades. We applied this method
to FIR filters and implemented on field-programmable gate arrays
(FPGAs). Note that recent FPGAs have embedded RAMs [1], [15] and
we can use these RAMs as cells of the LUT cascades [13].
VI. CONCLUSION AND COMMENT
In this paper, we first defined WS functions as a mathematical
model of bit counting circuits, radix converters, and DA. Then, we
796 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 25, NO. 5, MAY 2006
Fig. 12. Arithmetic decomposition of 4q-output WS function.
derived upper bounds on the column multiplicity for the standard
decomposition chart for a WS function.
WS functions with small weights have decomposition charts with
small column multiplicity. Thus, they can be efficiently implemented
by an LUT cascade. Since column multiplicity is equal to the width of
the BDD [5], a WS function with small weights has a small BDD. The
results of this paper imply that a neural net with small weights can be
efficiently implemented by LUT cascades.
ACKNOWLEDGMENT
The author would like to thank Prof. Iguchi for very helpful discus-
sions and Prof. J. T. Butler for discussions that improved the English
presentation.
REFERENCES
[1] Altera Cyclone FPGA. [Online]. Available: http://www.altera.com/
[2] H. A. Curtis, A New Approach to the Design of Switching Circuits.
Princeton, NJ: Van Nostrand, 1962.
[3] L. Mintzer, “FIR filters with field-programmable gate arrays,” J. VLSI
Signal Process., vol. 6, no. 2, pp. 120–127, Aug. 1993.
[4] K. K. Parhi, VLSI Digital Signal Processing Systems Design and Imple-
mentation. New York: Wiley, 1999.
[5] T. Sasao, “FPGA design by generalized functional decomposition,” in
Logic Synthesis and Optimization, T. Sasao, Ed. Norwell, MA: Kluwer,
1993, pp. 233–258.
[6] ——, Switching Theory for Logic Synthesis. Norwell, MA: Kluwer,
1999.
[7] T. Sasao, M. Matsuura and Y. Iguchi, “A cascade realization of multiple-
output function for reconfigurable hardware,” in Proc. IWLS, Lake Tahoe,
CA, Jun. 12–15, 2001, pp. 225–230.
[8] T. Sasao and M. Matsuura, “A method to decompose multiple-output logic
functions,” in Proc. Design Automation Conf., San Diego, CA, Jun. 2–6,
2004, pp. 428–433.
[9] T. Sasao, J. T. Butler, and M. Riedel, “Application of LUT cascades to nu-
merical function generators,” in Proc. 12th SASIMI Workshop, Kanazawa,
Japan, Oct. 18–19, 2004, pp. 422–429.
[10] T. Sasao, “Radix converters: Complexity and implementation by LUT
cascades,” in Proc. Int. Symp. Multiple-Valued Logic, Calgary, AB,
Canada, May 18–21, 2005, pp. 256–263.
[11] T. Sasao and M. Matsuura, “BDD representation for incompletely
specified multiple-output logic functions and its applications to func-
tional decomposition,” in Proc. Design Automation Conf., San Diego, CA,
Jun. 2005, pp. 373–378.
[12] T. Sasao, “Analysis and synthesis of weighted-sum functions,” in Proc.
Int. Workshop Logic Synthesis, Lake Arrowhead, CA, Jun. 8–10, 2005,
pp. 455–462.
[13] T. Sasao, Y. Iguchi, and T. Suzuki, “On LUT cascade realizations of FIR
filters,” in Proc. 8th Euromicro Conf. DSD, Porto, Portugal, Aug. 30–Sep.
3 2005, pp. 467–474.
[14] S. A. White, “Applications of distributed arithmetic to digital signal
processing: A tutorial review,” IEEE ASSP Mag., vol. 6, no. 3, pp. 4–19,
Jul. 1989.
[15] XILINX Spartan FPGA. [Online]. Available: http://www.xilinx.com/
