High-performance field programmable VLSI processor based on a direct allocation of a control/data flow graph by 亀山 充隆
High-Performance Field Programmable VLSI Processor Based on a Direct
Allocation of a Control/Data Flow Graph
Naotaka Ohsawa, Masanori Hariyama, Michitaka Kameyama
Graduate School of Information Sciences
Tohoku University
Aoba05, Aramaki, Aoba, Sendai, 980-8579, Japan
ohsawa@kameyama.ecei.tohoku.ac.jp
Abstract
As cost-effective approach to develop special-purpose
processors, field programmable gate arrays (FPGAs) are
widely used. However, their major disadvantage is their
low performance because of large delays of programmable
interconnection networks. This paper proposes a high-
performance field programmable VLSI processor (FPVLSI).
A bit-serial PE array is presented to reduce complexity
of programmable interconnection networks. Therefore, the
area and the delay of a switch block in the interconnection
network can be greatly reduced. Moreover, a direct allo-
cation of a control/data flow graph is employed where only
a single node is mapped into a PE so that the wiring com-
plexity is greatly reduced. The FPVLSI with 4400 PEs is
designed in a 0.35µm CMOS process. The performance of
the FPVLSI is evaluated to be 28 times higher than that of
the typical FPGA when executing the 16-point FFT.
1 Introduction
To realize real-world intelligent systems such as highly
safe intelligent vehicles, quick response to dynamically
changing environment is required [1, 2]. Since such
a requirement exceeds the computing power of state-of-
art general-purpose processors, development of special-
purpose processors becomes important.
There are two main approaches for realizing a special-
purpose processor: application specific integrated circuits
(ASICs) and field programmable gate arrays (FPGAs).
ASICs allow high-speed processing, but they are both ex-
pensive and inflexible.
On the other hand, FPGAs are cost-effective and flexi-
ble since FPGAs consist of programmable logic blocks and
programmable switch blocks to make design modifications
after production [3, 4]. The major disadvantage of FPGAs
is its low performance from following reasons.
• The area and delay of a switch block become large
since a switch block consists of many programmable
switches.
• The time for data transfer between logic blocks be-
comes large since data from one logic block usually
traverse through many switch blocks to reach the other
logic block.
To solve these problems, this paper proposes a high-
performance field programmable VLSI (FPVLSI) proces-
sor.
Firstly, an ultra-highly-parallel processing element(PE)
array based on a bit-serial architecture is proposed. One PE
can communicate with only adjacent PEs so that the com-
plexity of the switch blocks is reduced. A bit-serial archi-
tecture also reduces the complexity of switch blocks. As a
result, compact and high-speed switch blocks are designed.
To realize an ultra-highly-parallel PE array, a compact
control/memory module for a bit-serial PE is presented.
Based on the regular data flow of bit-serial operations, the
control/memory module is realized using a single shift reg-
ister. This results in increase of the number of PEs in the
FPVLSI.
Secondly, a direct allocation of a control/data flow graph
(CDFG) is proposed. In the direct allocation, only a single
operation in a CDFG is mapped into a single PE so that an
input of one PE is connected to only one output of another
PE. As a result, complexity of interconnection networks be-
tween PEs is reduced. This is, the number of switch blocks
that data must traverse is reduced.
For evaluation, 16-point FFT for 1000 data sets is per-
formed on the FPVLSI. The performance of the FPVLSI is
28 times higher than that of the typical FPGA.
2 PE array based on a bit-serial architecture
2.1 Structure of a typical FPGA and its problems
Figure 1 shows a structure of a typical FPGA.
Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI02) 
0-7695-1486-3/02 $17.00 © 2002 IEEE 
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on February 8, 2010 at 21:15 from IEEE Xplore.  Restrictions apply. 
LS
Logic Block
Switch Blocks
C
S
LUT
input output
Crosspoint Switches and
Configuration Memories
L
L
S
C
C
S
S
C
C S S C
C S S C
C
C
SC
C
C
L
S
L
L
C
C
SC
C
C
Figure 1. Block diagram of an FPGA.
Functional elements are logic blocks, denoted by L, that
are capable of implementing logic functions of a few vari-
ables.
The interconnection wires are provided in horizontal
and vertical channels. Connections are made using pro-
grammable switch blocks. In the structure depicted in Fig.
1 there are two types of switch blocks. One type, denoted
by C, is used to provide connection between a logic block
and the wires in a channel. The second type, denoted by S,
contains switches that provide connection between wires in
crossing channels. The number of switches in each block
has a direct impact on the flexibility of routing, cost, and
propagation delays.
There are two problems in typical FPGAs.
The first problem is a large propagation delay along a
data path. Many programmable switches in a switch block
improve the flexibility, but the area and delay of the switch
block increase. Moreover, data from a logic block may tra-
verse through many switch blocks so that the problem of the
area and the delay becomes more serious.
The second problem is a delay of lookup table (LUT)
based logic block incleases compared with a gate-based
ALU.
2.2 Overview of the FPVLSI
As shown in Fig. 2, the FPVLSI consists of processing
elements(PEs) and switch blocks between PEs. All the PEs
are arranged in a two-dimensional array structure and each
PE is connected to only four neighboring PEs. Therefore,
the number of programmable switches in a switch block can
be reduced in comparison with that of the FPGA. The major
consideration on the PE array is to find a mapping of oper-
ations onto PEs that can localize communication between
PEs. For the purpose, a direct allocation of a CDFG is de-
scribed in Section 3.
A PE consists of an ALU and a memory/control module.
The ALU is realized using the hardwired circuit so that the
area and delay of the ALU become smaller than that of the
LUT in the FPGA.
ALU
CMMPE
PE
SB
SB
SB
PE
PE
SB
SB
SB
PE
PE
SB
DFF
an bn
cn
FA
1-bit Switch Block
1-bit ALU
Processing Element Control/Memory Module
Figure 2. PE array structure for an FPVLSI.
DFF
a
c
b
Input
q
M
U
X
FA
M
U
X
Logical Operation
Arithmetic Operation
Output
ALU
Memory/
Control
Unit
M
U
X
Control Signal
Control
Signal
Input
a
c
b Output
(a) Block Diagram of a bit-serial PE
(b) Block Diagram of a bit-serial ALU
Figure 3. Block diagram of a bit-serial PE.
The FPVLSI is designed based on a bit-serial architec-
ture from the following reasons.
1. The complexity of a switch block can be reduced in
comparison with a bit-parallel architecture. That is, the
number of the programmable switches in the switch
block can be greatly reduced. This results in a great
reduction of the area and delay of the switch block.
2. A high bit-level utilized ratio of the bit-serial ALU can
be achieved. The word length in computation depends
on application and is not predetermined. Therefore, a
bit-parallel ALU may lead to a low bit-level utilized
ratio. On the other hand, a bit-serial architecture has
no dependencies between a word length and a bit-level
utilized ratio.
3. The area of a bit-serial ALU becomes considerably
smaller than that of a bit-parallel ALU. Therefore,
Highly parallel PE array can be constructed.
2.3 PE based on a bit-serial architecture
Figure 3(a) shows the detailed structure of the bit-serial
PE. The PE consists of an ALU and a control/memory unit.
Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI02) 
0-7695-1486-3/02 $17.00 © 2002 IEEE 
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on February 8, 2010 at 21:15 from IEEE Xplore.  Restrictions apply. 
Counter
Memory Unit (Shift Register)
D D D D
MUX
M
U
X
M M M M
Address Decoder
SRAM
Cell
...
Input/Output
Memory Unit (SRAM)
Address
Input
Output
(a) Conventional bit-serial Memory Module
(b) Bit-serial Memory Module based on
Shift Register
Figure 4. Memory module based on a shift-
register.
One-hot Counter(Shift Register)
D D D D
MUX
M
U
XInput "1" at reset
"1" after n clock cycles
Select th DFFn
Figure 5. Control unit based on a one-hot
counter.
2.3.1 ALU
As is shown in Figure 3(b), a bit-serial ALU consists of a
full adder, a D flip-flop and some logic gates. The full adder
is used to perform arithmetic operations in a bit-serial man-
ner. The D flip-flop is used to propagate the carry of sum
of ith input bits to the sum of (i+1)th input bits. The logic
gates are used for the partial products in the multiplication
process, the 2’s complement of the subtrahend, and logical
operations.
2.3.2 Memory/control unit
In conventional bit-serial architecture, a counter is used to
cotrol bit-by-bit memory access as shown in Fig. 4(a).
Since a counter is relatively larger than the bit-serial ALU
and the memory module, it occupies the most area of the
PE. To reduce the area of the PE, a memory module that
consists of a shift register is proposed based on the regular-
ity of bit-serial memory access. Figure 4(b) shows the block
diagram of the memory module. In a bit-serial architecture,
any particular bit in the memory module is not necessary to
be accessed randomly. This regularity enables us to use a
shift register as a memory module. By using the shift regis-
ter, the bit-by-bit memory access can be performed without
a counter.
...
n
2
n
Data Set1
Data Set2
Data Set n
(a) CDFG (b) Mapping into PEs (c) Time chart
Time
Word Length: n bit
m
u
lt
ip
ly
m
u
lt
ip
ly a
d
d
a
d
d
Idle
PE
SB
Stage2
Stage1
Figure 6. Low utilization of PE due to the dif-
ferent delays of operations.
D4D3an
FA
D1
D3
FA
D1
0
b1 b2
D2
cout
cin
ans
cout
cin
ans
Figure 7. Block diagram of a 2-bit bit-serial
pipeline multiplier.
Moreover, the bit-serial ALU requires a counter to gen-
erate a RESET signal that indicates the division between
words. Figure 5 shows a one-hot counter consisting of a
shift register. The counter has the ”1” in the least index bit
at reset and shifts it one bit toward the greatest index bit at
each clock cycle, then ”1” is produced after n clock cycles.
By using this output, the ALU can be controlled.
The same shift register can be used for a memory unit
or a control unit, and the area of the PE can be reduced in
comparison with the conventional bit-serial architecture.
2.4 Performance matching by exploiting bit-level
parallerism
In the conventional bit-serial architecture, the time for an
operation greately changes depending on operation types.
For example, let us consider pipelining of a multiplication
and an addition as shown in Fig. 6(a). Assumed that an
addition is performed by a PE and that a multiplication is
performed by a PE as shown in Fig. 6(b). Then, time for a
multiplication becomes longer than time for an addition as
shown in Fig. 6(c). This leads to a low utilized ratio of the
PE that performes an addition.
To solve this problem, we introduce the performance
matching. The performance matching is based on the
idea of exploiting bit-level parallerism for an operation that
needs more computational power. As a result, the time for
an operation that needs more computational power. power
can be equal to the time for an operation that needs less
computational power.
In the FPVLSI, bit-serial pipeline multiplier is used to
Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI02) 
0-7695-1486-3/02 $17.00 © 2002 IEEE 
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on February 8, 2010 at 21:15 from IEEE Xplore.  Restrictions apply. 
SBPE PE
2 PEs
D4D3an
FA
D1
0 D2
cout
cin
ans
D5bn
SBPE PE
Figure 8. 2-bit bit-serial pipeline multiplier us-
ing PE array.
match performances of an addition and a multiplication.
This is because the Multiply ACcumulate(MAC) is a fun-
damental degital signal processing buliding block.
Figure 7 shows a 2-b bit-serial pipeline multiplier [5]. A
building block of a bit-serial pipeline multiplier consists of
a full adder, four D flip-flops and an AND gate. The AND
gate is used to compute the partial products in the multipli-
cation process. The Dn represents a D flip-flop. The D1 is
used to propagate the carry of sum of ith input bits to the
sum of (i+1)th input bits. The D3 is used to shift the mul-
tiplicand. The D2 and D4 are registers storeing intermedi-
ate results. To implement this multiplier, several building
blocks are connected in series, one building block for each
bit of the numbers to be multiplied. In this multiplier, a sum
of partial products matrix is produced using the pipeline de-
sign thchnique. As a result of this, the time necessary to
produce a product is 2n clock cycles, where n is a word
length of the numbers to be multiplied.
Figure 8 shows the block diagram of the bit-serial
pipeline multiplier that is implemented using the PE array.
To implement the multiplier, each PE needs to have three
dedicated D flip-flops. The D3 and D4 correspond to the D3
and D4 in Figure 7, respectively. The D5 is used as a serial
to parallel converter because the multiplicand must be pre-
sented at the same time. To use two bit-serial pipeline mul-
tipliers concurrently, the time for the multiplication would
be equal to the time for an addition.
3 Direct allocation of a CDFG for localization
of data transfer
In the PE array based on data transfer between adjacent
PEs, the time for data transfer between PEs that are not
placed adjacently becomes large. The reason for this is that
the data traverse through a lot of logic blocks and switch
blocks step by step. Therefore, the allocation that localizes
the data transfer is important.
3.1 CDFG representation of an algorithm
Given an algorithm, we can compile the algorithm de-
scription into a control/data flow graph (CDFG) represen-
tation as shown in Fig. 9. The CDFG is a graph that rep-
true false
endif
1 2 3if(a<0) {
ans = 2 * 3 + 1;
} else {
ans = 4 / (5 - 6);
}
(a) C language description (b) CDFG description
Data Flow
Control Flow
4 5 6
a<0
Figure 9. Descriptions of an algorithm.
O1 O2
O3
O1
O2
O3
MUX MUX
PE
PE
(a) CDFG (c)Mapping into PEs
SB
SB
SB
SB
SB
SB
SB
O1
O3
O2
MUX
MUX
PE Data Transfer
(b) Datapath
interconnections
Figure 10. Allocation to minimize the number
of functional units.
resents data dependencies between operations. Each node
of the CDFG represents an operation and each edge of the
CDFG represents a data dependency between the opera-
tions. For example, Figure 9(b) shows the CDFG that rep-
resents the algorithm shown in Fig. 9(a).
3.2 Problem of a typical allocation
To execute an algorithm represented by the CDFG, map-
ping of the operations in the CDFG into the functional units
is needed. This task is called allocation.
An allocation to minimize the number of functional units
is one of the most typical allocations. Suppose that the
CDFG shown in Fig. 10(a) is given. In this allocation,
some nodes in the CDFG are mapped into the same func-
tional unit. Hence, the utilization of the functional units
becomes high and active area becomes small.
However, in the allocation, we need a lot of multiplex-
ers for unit interconnection (Figure 10(b)). Therefore, the
number of the PEs for interconnection becomes large and
the time to perform the data transfer increases.
3.3 Direct Allocation of a CDFG
To solve this problem, a direct allocation of a CDFG is
proposed. Suppose that the CDFG shown in Fig. 11(a) is
given. In the direct allocation, a single node in the CDFG is
mapped into a single functional unit as shown in Fig. 11(b).
Therefore, a PE executes only one operation and the con-
nection between PEs is fixed.
In the direct allocation, there are two main advantages
against the allocation to minimize the number of functional
units.
Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI02) 
0-7695-1486-3/02 $17.00 © 2002 IEEE 
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on February 8, 2010 at 21:15 from IEEE Xplore.  Restrictions apply. 
O1 O2
O3
PE PE
PE
SB
SB
SB
SB
O1
O3 O2
PE Data Transfer
O1 O2
O3
(a) CDFG (c)Mapping into PEs(b) Datapath
interconnections
Figure 11. Direct allocation of operations in a
CDFG.
a < 0
ab
c
true false
endif
a
c
b
(b)Mapping into PEs
SB
SB
SB
SB
SB
SB
SB
MUX
PE Data Transfer
<
+
--+
Control Flow
Data Flow
(a) CDFG
Figure 12. Direct allocation of operations in a
control flow.
Firstly, complexity of the interconnection is minimized
as shown in Fig. 11(b). This is because an input of one
function unit is connected to an output of another function
unit. Therefore, the number of the PEs for interconnection
can be reduced, and the time to perform the data transfer
decreases (Figure 11(c)).
Secondly, a control unit for a control flow is not neces-
sary in the PE. The reason for this is that control signals for
control flow are produced by other PEs. Thus, the size of
a PE is considerably small and highly-parallel PE array can
be constructed.
4 Evaluation
4.1 Chip layouts and features
Figure 13 shows a chip layout of the FPVLSI designed
in a 0.35µm CMOS process. Features of the FPVLSI are
summarized in Table 1. The clock period is the sum of the
delays of a switch block and a PE from the HSPICE circuit
simulation.
Let us compare the FPVLSI with the FPGA shown in
Fig. 1. A detailed structure of the FPGA is based on [4].
Features of the FPGA are summarized in Table 2. Because
of the wiring complexity, a delay of a switch block of the
FPGA becomes larger than that of the FPVLSI.
4.2 Evaluation using FFT
Assume that the chip area is 10mm×10mm. Let us eval-
uate the performance of the FPVLSI when 16-point FFT is
10mm
1
8
0
m
1
0
m
m
125 m
ALU/
Switch
Block
Shift Reg.PE
PE
SB
SB
SB
PE
PE
SB
SB
SB
PE
PE
SB
PE
SB
SB PE
SB
SB PE
SB
Figure 13. Chip layout of the FPVLSI.
Table 1. Features of the FPVLSI.
Technology 0.35-µm double-poly
triple-metal CMOS
Chip size 10 × 10mm2
Number of PEs 4400
Delay of a PE 1.3ns
Delay of a switch block 0.7ns
Clock period 1.9ns
performed for 1000 data sets with 16-b fixed-point calcula-
tion.
Let us begin with evaluation of butterfly processing el-
ements (BPEs) since the FFT is performed by iteration of
FFT butterfly computation. Figure 14 (a) shows a CDFG
of butterfly computation. Figs (b) and (c) show its allo-
cations into the FPVLSI and the FPGA, respectively. Ta-
ble 3 summarizes a performance comparison between an
FPVLSI-based BPE and an FPGA-based BPE. Note that the
normalized throughput of the FPVLSI-based BPE is 1.18
times higher than that of the FPGA-based BPE. This is be-
cause the wiring delay of the FPVLSI-based BPE can be
reduced to 72% based on a bit-serial architecture in com-
parison with that of the FPGA-based one.
Next, let us evaluate the performance of the FFT imple-
mentation. To implement the 16-point FFT, four butterfly
processing elements are connected in series as shown in Fig.
15. Table 4 shows a comparison between an FPVLSI-based
Table 2. Features of the FPGA.
Technology 0.35-µm double-poly
triple-metal CMOS
Chip size 10 × 10mm2
Logic block structure 4-input LUT × 2
Number of logic blocks 2116
Number of wire tracks 16
Delay of a logic block 0.93ns
Delay of a switch block 1.66ns
Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI02) 
0-7695-1486-3/02 $17.00 © 2002 IEEE 
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on February 8, 2010 at 21:15 from IEEE Xplore.  Restrictions apply. 
...
......
...
...
...
... ...
16 Logic Blocks
1
6
L
o
g
ic
B
lo
c
k
s
(b) Allocation into the FPVLSI (c)Allocation into the FPGA
MUX MUXMUXMUX
Memory
:Unused PE
:Unused Logic Block
(a) CDFG of butterfly computation
Figure 14. Allocations of a CDFG of FFT but-
terfly computation.
Table 3. Performance comparison using
BPEs.
FPVLSI FPGA
Area 3.6mm2 70.0mm2
Normalized wiring delay 1.00 1.38
Normalized throughput 1.18 1.00
FFT implementation and an FPGA-based one. The num-
ber of BPEs of the FPVLSI-based FFT implementation is
24 times larger than that of the FPGA-based one. Since
the FFT can be performed in parallel for all the data sets,
the performance increases in proportion to the number of
BPEs.
Finally, the total performance of the FPVLSI-based im-
plementation for 1000 data sets is 28(= 1.18 × 24) times
higher than that of the FPGA-based one.
M
U
X
M
U
X
M
U
X
M
U
X
M
U
X
M
U
X
B
P
E
B
P
E
B
P
E
B
P
E
Butterfly Processing Element Shift Register
D4
D4
D2
D2
D1
D1
Dn: n-bit Shift Register
Figure 15. Block diagram of an FFT.
Table 4. Performance comparison between
the FPVLSI and the FPGA.
FPVLSI FPGA
Number of BPEs on a chip 24 1
Normalized throughput 1.18 1
Normalized performance 28 1
5 Conclusion
The high-performance FPVLSI is proposed to solve the
data transfer bottleneck between PEs. To reduce the area
and delays of the interconnection networks, key technolo-
gies are a bit-serial architecture and a direct allocation of a
CDFG. The FPVLSI does not require a global control unit,
delays of the control signal can be also reduced.
Higher clock rate in the bit-serial architecture will be-
come a problem when a PE communicates with the outside
world. Thus, efficient I/O interface for bit-serial architec-
ture will be the future works.
References
[1] M. Kameyama, Y. Fujioka, VLSI Processor System for
Robotics, Journal of Robotics and Mechatronics, vol.8,
no.6, 1996.
[2] M. Hariyama, M. Kameyama, Path Planning Based
on Distance Transformation and Its VLSI Implemen-
tation, Journal of Robotics and Mechatronics, vol.12,
no.5, 2000.
[3] W. Carter, K. Duong, R. H. Freeman, H. Hsieh, J. Y.
Ja, J. E. Mahoney, L. T. Ngo, S. L. Sze, A user pro-
grammable reconfigurable gate array, Proc. Custom
Integrated Circuits Conf., pp.232-235, May 1986.
[4] P. Chow, S. O. Seo, J. Rose, K. Chung,G. Paez-
Monzon, I. Rahardja, The design of an SRAM-based
field-programmable gate array-Part I : Architecture,
IEEE Trans. VLSI Syst., vol.7, no.2, pp.191-197, 1999.
[5] N. Weste, K. Eshraghian, Principles of CMOS VLSI
Design, Addison-Wesley, pp.340-343, 1985.
Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI02) 
0-7695-1486-3/02 $17.00 © 2002 IEEE 
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on February 8, 2010 at 21:15 from IEEE Xplore.  Restrictions apply. 
