A modified reconfigurable data path processor by Maki, G. et al.
3,'d NASA Sympo._i,,m on VLSr Dcsign 19.01
A Modified Reconfigurable Data Path Processor
G. Ganesh, S. Whitaker and G. Maki 1
NASA Space Engineering Research Center for VLSI System Design
University of Idaho, Moscow, Idaho 83843
Phone: 208-885-6500 Fax: 208-885-7579
Abstract- High throughput is an overriding factor dictating system perfor-
mance. In this papers a conflgurable data path processor is presented which
can be modified to optimize performance for a wide class of problems. The
new processor is specifically designed for arbitrary data path operations and
can be dynamically reconflgured.
1 Introduction
High performance computers are increasingly in demand in areas of weather forecasting,
structural analysis, etc.. These often require architectures which are different from the
standard von-Neumann's machine also called the Standard Stored Program Computer.
The stored program computers are designed to be general purpose and is not optimized for
any specific problem. Fully customized architectures can be optimized to achieve maximum
performance for a specific problem, but such processors cannot usually be adapted to
produce solutions to different problems.
Modern technology opens new dimensions to the designer of high performance systems,
by providing low cost VLSI modules which have high computational throughput. For a
given functionality, there are two major dimensions of performance:- Delay and Through-
put. High throughput is the most critical factor in real time processing of massive amounts
of data, for example in Digital Signal Processing, Data Base operations, etc.. Since gen-
eral purpose parallel computers cannot offer real time processing speeds, special purpose
computers become the only appealing alternatives.
Special purpose processors can be of two types: 1) Dedicated Processors and 2) Recon-
figurable/Programmable Processors. While the former are characterized by high processing
speeds, inflexibility, long design time and high design cost, the latter have advantages of
greater flexibility in coping with changes in the object problem, system specification and
greater design economy with some reduction in throughput.
This paper presents a general purpose accelerator which is an enhancement over [1],
that allows a variety of data path configurations, each characterized by its own topology
of activated interconnections and hence applicable to a wide range of applications.
This configurable architecture combines the general purpose advantages of the stored
program machine with the optimization of a fully customized architecture to achieve max-
imum performance for a broad class of problems. Every functional unit, data path and
1This research was supported ( or partially supported ) by NASA under Space Engineering Research
Center Grant NAGW-1406.
https://ntrs.nasa.gov/search.jsp?R=19940013887 2020-06-16T18:08:25+00:00Z
6.4.2
Input Port
Programming Path
Data
Path
Outp!t Port_
State l
Controller[ HacdoSh:k;ng
Figure 1: Block Diagram
control structure can be individually optimized for a given algorithm. The architecture
presented is capable of operating in parallel, pipellned or sequential-modes. The user
configures the data path through programming. The architecture can be altered during
operatlon by reprogramming or can be initialized and fixed for dedicated processing or can
be attached to a host processor.
The reconfigurable processor dlfrers from the stored program compu+ter in the sense
that there is no ' ' _- +instruction i'etch-decode-execute cycle. Moreover, an operation can be
executed every clock pulse in every data path element.
2 Processor Design
The data path and the control structure have been designed to allow sequential, pipelined
or parallel operation. The processor is configured as a set of identical data path elements
with an overall controller. The top level block diagram of this processor is shown in Figure
i. There are two major components: the data path, which is an ALU-register stack to
manipulate the data, and the state c0ntr0iler, which controls the register stack. The actual
hardware configuration of the data path is specified during the programming of the State
contro]Jer: : -:: +
2,1 Data Path
Each data path element is as shown in Figure 2. Let there be m data path elements, each
n bits wide. Direct communication between each data path element is an essential feature
to achieve pipeline or parallel operation. Therefore, to allow all possible register to register
communications, the data path bus must be m × n bits wide. This complete connectivity
results in the flexible reconfigurability, but also limits the number of data path elements.
Each data path element consists of a Multiplier Accumulator (MAC) which multiplies
two eight bit numbers and also adds two sixteen bit numbers to the product. (a.b+c+d).
This output is stored in a globally accessible register of the data path element. Also
z
E
Z
+
3rd NASA Symposium on VLS[ Design 1991
Mux 
I Sell [._1
[ Unit [Utrl
Max. Mux°
I Logicl [CtL[_grl;[ Sel2Unit Unit
1
:l D
Unit
a b c d
MAC
6.4.3
a.b+c+d
etrg
1
Figure 2: Data Path Element
contained within each data path element are two dedicated registers, which are used for
operations local to that data path element. The addition of these dedicated registers is
one of the improvements over [1]. This avoids the use of an entire data path element for
the purpose of storage only. Since the area of a data path element is constrMned by the
rn × n interconnect bus the addition of these registers should have little impact on the
overall chip area.
The data path also contains a set of ALU and selector units. The ALU can implement
an arbitrary arithmetic/logic operation. The operations of the first logic unit is as shown
in Table 1. The selector unit selects the output of its respective multiplexors or the output
of the respective dedicated register as shown in Table 2. The rn to 1 multiplexors can
select the output of any of the m globally accessible registers. Th_e MAC operates on the
output of the selector unit and the logic unit to allow a mixture of arithmetic and logic
functions. Table 3 shows example of ALU operations that can be performed. CI is the
carry in data bit.
Each globally accessible register is controlled as defined in Table 4. The dedicated
registers are controlled as shown in Table 5.
The control word for each data path element structure is shown in Figure 3. For 16
data path elements, the control word is 33 bits wide.
Logic Control Logic Operation
0000 0
0001 A 'AND B
0010 A AND B'
00!1 A
0100 A' AND B
0101 B
0 1 ! o A XOR B
0!!! AORB
I 000 A NOR B
I 00 I A XNOR B
I010 B'
10! I A'NANDB
i!o0 A'
1 !01 ANANDB'
1 ! 1 0 A NAND B
1111 !
Table 1: Logic u_it 1 Control
MC Selector Output
00 A Mux
01 B Mux
m
10 Drl
11 0
Table 2: Selector Unit 1 Control
.....................
Logic Unit Sel Unit
0000 A
1!11
0011
0101
1010
1100
1010
CI Output
1 A+I
A 0 A+I
- i A+I
A
A
A
0 A plus B
0 l's complement A - B
1 2's complement A
1 2's complement A - B
Table 3: Example ALU operations
3_tr%J2Vtr]2J 2:tr12J22ir[1J12tr]lJl_trl 9 IStr:12tr:lVtr2, 1 2 ICtr]lCtrll_ 0
F!gure 3: Data Path E!eme_a t Control Word
z
3rd NASA Symposium on VLSI Design 1991 6.4.5
RC1 RC2
0 0
0 1
1 0
1 1
Register Function
Hold Present Data
Load MAC Output
Shift MAC Right and Load
Shift MAC left and Load
Table 4: Global Register Load Control
Dr1 Register Function
0 Hold Present Data
1 Load Mux Output
Table 5: Dedicated Register Load Control
2.2 Control
The state controller specifies the control words for each data path element. The hardware
compiled control words are contained in a control store memory as depicted in Figure
4. The output of each word from the control store drives each data path element. A
total of 536 bits are needed in each control store word to control the data path elements
in a 16 element, 16-bit data path structure. Program control within the control store is
implemented with a program location counter. The control store can be of an arbitrary
depth; here, it is depicted as 256 words deep. To perform a jump within the control store,
an 8-bit jump address is provided in each control store word as depicted in Table 6.
The control store must be specified prior to operation. This specification (hardware
compilation) can be achieved through the input port, 16 bits at a time. After the control
store is specified, the processor is ready to operate in real time.
3 Operation
The control store word defines the operation and the source of data (registers) for each
data path element. The output of any pair of registers Ri and Rj, i_ = 0,1,2,...,15 can be
input to a data path element. In general, the operation can be specified as
Ri[ALUoperation]Rj ---* Rk (i)
33 bits
Data
Path
Control
Word
Cell 0
33 bits
Data
Path
Control
Word
Cell 1
Data
Path
Control
Word
Cell i
33 bits
Data
Path
Control
Word
Cell 15
8 bits
Program
Counter
Address
Table 6: Control Store Word
6.4.6
-t--------'--
o
v I
o
v F I
E L
FR.__ 0 Iwl
L _
O L I
W O I
I I
c _
i
i
I
I
I
t
Input
CELL 0
I
i
CELL 1 _
CELL 2 _
CELL 3
CELL 4
CELL 5
CELL 6
CELL 7
+
CELL 8
CELL 9
CELL 10
CELL 11
CELL 12
CELL 13
CELL 14
CELL 15
Output
16
I
CONTROL
STORE
256
Process
Control
Words
of
536
Bits
Each
Figure 4: Control Store and Data Path
I D
i ,i.
_R
i O
t G
!R
536 A
"-_ M
iN
G
_R
i E
t _
J _
I q
F _r_
t _
I E
I R
I D
[R
O
G
8 R
A
M
C
O
8 U
'-+-- N
T
E
R
Z
z
i
=
r
i
3rd NASA Symposium on VLSI Design 1991 6.4.7
which means that the result of an ALU operation upon the contents of any register pair
R_ and Rj can be placed into register Rk. This is true for any and all registers in the data
path and all operations occur simultaneously. Since each data path element can function as
an independent element, the entire data path can be configured to operate in the sequential,
pipelined or parallel modes. The controller also specifies the next state of the controller
and provides handshaking for external input and output control functions. The memory
can be ROM for dedicated processing or RAM or EPROM where field programmability
is desired. Depicted in Figure 4 is a feature where the control store can be programmed
via the input data port. The entire control store can be initialized in a 16 bit word serial
manner.
The control store is specified prior to operation. Once the control store is specified, the
processor executes at the rate specified by the system clock. With static cells, the system
clock can range from d.c. to the maximum allowable by the IC process.
3.1 Examples
Consider the following Digital Filter examples to illustrate the use of this processor. The
general second order difference equation is
y(n) =- aox(n) + alx(n --1) + a2x(n-- 2)
- 1) - - 2).
This implements an IIR filter. For an FIR filter the equation simplifies to
(2)
y(n) = aox(n) + alx(n -- 1) + a2x(n -- 2). (3)
To implement the FIR filter in the architecture presented in this paper, let Dr61 contain
a0, Drsl contain al and Dr41 contain as as shown in the simplified block diagram of Figure
5. Also let R4, Rs, R6 and Ro be initially reset. The operations can be described in a
register transfer language where each Pi is a control state that defines the data transfers
that take place when Pi is active.
P0: Data _ Ro
Pt: Ro • Dr61 q- Rs --* R6, Ro • Drsl q- R4 --+ Rs,
Ro • Dr41 ---+R4
Assuming that constants are preloaded into the registers and that 2's complement
arithmetic is used, the control word for each data path element (Ri) is shown in Table 7.
Each control state Pi represents one parallel control word; the portion of the control word
for each Ri is shown on a series of lines for the sake of simplicity. Register control for all
other registers not shown in Table 7 , the register control bits in their control words are
00, indicating no operation, for that control state, Pi.
There are a total of 5 operations that occur in 2 clock pulses. If this processor operated
at 20 MHz, 50 million operations per second would be performed.
6.4,8
0 0
l?
_'R5 I
Figure 5: FIR Filter Block Diagram
= Y
!
_:=: :: . _
--State Reg MuxA MuxB MuxC MuxD
Po Ro A
P1 R4 Ro
R5 Ro R4
-Re- R0 R_
-St_e" Reg.. ALU! ALU2 SC1
Po Ro 1111 0000 00
Pt R4 0011 0000 I0
SC2 CI RC
ii 0 01
II 0 01
0 01R5 0011 0000 i0 00
R6 00ii 0000 I0 00 0 01
Table 7: FIR Filter Control Word Programming
z
3rd NASA Symposium on VLSI Design 1991 6.4.9
State
P0
P1
Reg MuxA
Ro A
R2 Ro
Re Ro
R_ Ro
_ho Ro
R9 R9
Rs R9
MuxB MuxC
Re
R7
MuxD I
.
State
P0
Pl
Reg ALU1 ALU2 SC1 SC2 CI RC
R0 1111 0000 00 11 0 01
R2 0011 0000 10 00 0 01
Ra 0011 0000 10 00 0 01
R7 0011 0000 10 11 0 01
R10 1111 0000 00 11 0 01
R9 0011 0101 10 00 0 01
R8 0011 0000 10 11 0 01
Table 8: IIR Filter Control Word Programming
Dr21 [ _o
Dr81[ al
Drvl l a2
Dr91] fll
Drsl[ f12
I y(n)
R9 ] y(n-1)i Rs y(n-2)
For an IIR filter, consider the following register assignment. A register transfer language
description of the operations to implement the IIR filter equation would be
P0: Data --* Ro
PI: Ro • Dr21 q- R6 --_ R2, Ro • Dr61 nu R7 --* R6,
R0 • Dr71 ---* RT, Drsl ' R9 _ R8,
R2 + Rs + R91 • R9 ---* Rg, R9 -* R10.
Assuming again that constants are preloaded into the registers as 2's complement num-
bers and that 2's complement arithmetic is used. The control word for each data path
element is shown in Table 8. There are a total of 9 operations that occur in 2 clock pulses;
operating at 20 MHz, 90 million operations per second would be performed.
4 Summary
A new architecture has been presented which allows for sequential, pipelined, or parallel
operation. A control-data path structure consists of m identical data path elements. The
data path elements can be independently specified to allow parallel or pipelined operation.
The control of the data path is specified by the control store memory. The processor can be
6.4.I0
a dedicated stand Mone machine or attached to a general purpose processor. As an attached
processor, it can be dynamically modified to assume different data path configurations if the
eon{rol store is RAI_ based. It is proposed that this architecture is a first step in producing
a _iocessor that allows the digital designer the same kind of flexibility in altering data path
cofifigurations as field programmable gate arrays offer alternatives to the logic designer.
Aekn_ledge_n_ This research was supported in part by NASA under grant NAGW-
1406 and grant NAG5-1043.
References
[i] G. Maki, S. Whitaker and G. Ganesh, "A Reconfigurable Data Path Processor",
Proceedings of the i_ hSi{_ Conference, _Ocllester _, Sep{., i901.
z
E
i
i
|
i
