A reconfigurable interface for systolic arrays / by Seaman, Anthony William
Lehigh University
Lehigh Preserve
Theses and Dissertations
1987
A reconfigurable interface for systolic arrays /
Anthony William Seaman
Lehigh University
Follow this and additional works at: https://preserve.lehigh.edu/etd
Part of the Electrical and Computer Engineering Commons
This Thesis is brought to you for free and open access by Lehigh Preserve. It has been accepted for inclusion in Theses and Dissertations by an
authorized administrator of Lehigh Preserve. For more information, please contact preserve@lehigh.edu.
Recommended Citation
Seaman, Anthony William, "A reconfigurable interface for systolic arrays /" (1987). Theses and Dissertations. 4823.
https://preserve.lehigh.edu/etd/4823
.. 
A Reconfigurable Interface 
For Systolic Arrays 
c..l 
by 
Anthony \Villiam Seaman 
A Thesis 
Presented to the Graduate· Committee 
of Lehigh University 
in Candidacy for the Degree of 
Master of Science 
. 
1n 
Electrical Engineering 
' Lehigh ·University 
1987 
. ~ ,.. . . ' 
. :\'\ . . 
• 
• 
.d ~·· .. t, • •• /: _.. • • '., \,,• ' I 
I 
• 
' -·~· .. 
... ., f. ..... . • 
·.' • ' • ' ·-· I, • • 
. f 
This thesis is accepted and approved in partial fulfillment of the requirements 
' 
' for the degree of Master of Science in Electrical Engineering. 
A~ 21 1 1921 
( date) 
Professor in Charge 
EE Division Chairperson 
CS EE Department Chairperson 
\ 
' . ,, '.•. . ..... ~ • • 11/1 '. • ' • • ~ • 
. . 
11 
' 
) 
L 
/ Acknowledgements 
I wish to express my gratitude to Dr. Meghanad D. Wagh for the invaluable 
assistance and insight he provided with this thesis. 11is gllidance and technical 
contributions· were essential to completing this work. Thanks also to the faculty, 
staff, and other grad._uate students of the CSEE department for their help during my 
research for this thesis. 
J 
••• 
"111 
) 
Table of Contents 
1. Introduction 
1.1 Systolic Arrays 
1.1.1 An Overview 
1.1.2 Current Technology 
1.2 Function of Interface 
1.3 Objectives of Thesis 
1.4 Organization of Thesis 
2. General Interface Design 
2.1 Introduction 
2.2 Basic Design Considerations 
2.3 Input Stager and RAM 
2.4 Switch Network 
2.5 Generalized Parallel-to-Serial Converter 
2.6 Accessing Results 
2.6.1 Overview 
2.6.2 Permutation Network 
2.6.3 Output Stager 
2.6.4 Task Scheduling 
2. 7 Memory Utilization 
2. 7. I Memory Control 
2. 7 .2 Optimization of Memory Size 
3. A Specific Interface Desig11 
3.1 Introduction 
3.2 Description of Problem 
3.3 Problem Solution 
3.4 Design Principles 
3.5 Collecting Results 
3.6 Control 
3.6.1 Microcontroller Hardware 
3.6.2 l\1icrosequencer Soft\\·are 
3. 7 Comparisons to General Interface 
4. Conclusion 
4.1_ Summary of Important Results 
4.2 Future Work 
,. , .;; 0 .. 
'., .... '.- .... - ,..., . ' . ,' •!- •.. ::· ,·• .. , .. . .•• , • .,_.,: . .,;f'.·· . ... 
1 
,'\ -.·J.'~~ ··,~"' ... ' '1', ,-:,.,,., • .; ..A.r./',P, .. ~~ ,.711 ' , 
. ' ' " . , ' ""L'• ·""'-• " ,a II. ,JI,;".'¥ •> 
. ., ·• ,, ........ I!"'.,.....~_ i~ . ··tt--:'<t . 
""' .. ,;:, ., .. . 
... .. ,_ ' 
. .. .... 
• 
IV 
.. 
1 ... , ~ ~• • • 
r 
• 
2 
2 
2 
3 
3 
5 
6 
7 
7 
7 
IO 
11 
13 
16 
16 
19 
20 
25 
37 
42 
43 
46 
46 
48 
49 
58 
61 
62 
62 
64 
71 
72 
72 
73 ' 
/' 
., ' 
...... .. ·. '"\ -~-~·-,(, ..... ,,. . I'• 
) 
Figure 2-1: 
Figure 2-2: 
Figure 2-3: 
Figure 2-4: 
Figure 2-5: 
Figure 2-6: 
Figure 2-7: 
Figure 2-8: 
Figur_e 2-9: 
Figure 2-10: 
Figure 2-11: 
Figure 2-12: 
Figure 2-13: 
Figure 2-14: 
Figure 2-15: 
Figure 2-16: 
Figure 2-17: 
Figure 2-18: 
Figure 2-19: 
Figure 3-1: 
Figure 3-2: 
Figure 3-3: 
Figure 3-4: 
Figure 3-5: 
Figure 3-6: 
Figure 3-7: 
Figure 3-8: 
Figure 3-9: 
Figure 3-10: 
Figure 3-11:· 
Figure 3-12: 
Figure 3-13: 
Figure 3-14: 
Figure 3-15: 
Figure 3-16: 
Figure 3-17: 
Figure 3-18: 
Figure 3-19: 
Figure 3-20: 
Figure 3-21: 
Figure 3-22: 
. . . . 
·• ... ~v\,,;. ':,,i- . ...:· .... ~-9c.,·.~·-- r11·. 
l • . ) ( ,'t rf 
List of Figures 
I _, 
Interface Block Diagrarn 
-Decoder State Machine 
Row-to-Column Permuting 
Input Stager Circuit 
Output Formats 
Simple-Shift/Complex-Load 
Simple-Load/Complex-Shift 
Accessing Results 
Permutation Network, N ==4 
Butterfly Network 
Output Stager Module 
Output Stager 
Output Stager, N==8 
Data Collection Circuit: N =l==B 
Data Collection/Propagation 
Interface Memory Block Diagram 
Memory Architecture 
Memory /Register Activity Table 
Complexity vs. i\1emory Size 
Laser Scan 
Good Scan 
Inside-out Scan 
Data Configuration 
Organization of a string in Flr""'O 
FIFO word organization- Flip 
r_,lip MUX 
Shift Register 
A-String Load 
SR Configuration 
SR After 6 Clocks 
SR Initial, Flip True 
SR: Clocks 2 thru 4 
Microsequencer 
Instruction Set ~ 
µ-Sequencer Block Diagram 
µController Block Diagram 
X.-string load routine 
X-string load Flowchart 
A-string load routine 
A-string load Flowchart 
Undo Flowchart 
- . 
, .. ---( 
V 
' • L._ ·'I· . -~ •. ~ " 
8 
9 
10 
12 
13 
15 
17 
18 
19 
20 
20 
22 
23 
33 
38 
39 
40 
44 
45 
47 
49 
50 
50 
51 
52 
53 
53 
54 
55 
56 
57 
58 
59 
60 
63 
64 
66 
67 
68 
69 
70 
I 
I 
Abstract 
-
-------..._ _,./ 
Demands for faster and more powerful signal processing capabilities have 
resulted in the proliferation of reconfigurable bit-se~ntial systolic arrays. Data 
communication with these machines must be at rates comm·ensurate with their 
operating speed in order to fully realize their computational potential. 
The front end for the systolic array is typically a general purpose computer 
which must be interfaced with the array. This thesis is concerned with studying 
such an. interface, identifying design problems, and proposing optimal solutions. 
Architectural differences lead to a great disparity in speed and data formats 
between the general purpose host computer and the reconfigurable array. The 
difference in operating speeds and asynchronicity is overcome with the use of FIFOs 
and careful selection of local memory ,vord . sizes. Input data conditioning is 
performed in a series of subsystems, each designed with VLSI implementation as a 
consideration. Results generated by the simultaneous execution of multiple tasks in 
. 
the array are routed through two subsystems. The first subsystem separates data 
by tasks and the. second reformats data to match the host computer's requirements 
and pipelines these words for optimal throughput. 
" 
. .. .... ..... "'( 
. '-· ·"( I .. • ,L' . ,• ' ~- ,I 'I ,-,".~. "\ 
I 
-
. - -
( 
\ 
1.1 Systolic Arrays 
1.1.1 An Overview 
Chapter 1 
Introduction 
Exploding technological applications have resulted in demands for ever 
increasing processing power in digital hardware. Applications such as image 
processing require data processing at rates exceeding 10 Mbytes/sec. To meet 
these ne€ds, many new architectural innovations have emerged in recent years. 
Among these, one of the most promising is the systolic array [ 1-4]. This 
architecture benefits from multiprocessing without sacrificing synchronization and 
the simplicity of data paths. Systolic Arrays are modular in nature, as the 
• 
name implies, and each module communicates only w·ith its nearest neighbor. 
This characteristic translates to a very efficient irnplementation in silicon. 
In a normal processor, an execution cycle is aJ,vays preceeded by an 
equally time-intensive instruction fetch cycle. Tbf> high throughput of a systolic 
array may be attributed to the fact that each processing element in it executes 
only. a single instruction and, therefore, no potential processing tin1e is spent 
.. 
performing an instruction fetch. Ho ,v ever, th is res u I ts i n a v er y inflexible 
architecture which requires a separate array for each task to be executed. Such 
limitations have prompted researchers to investigate reconfigurable systolic 
arrays. Processing elements in such arrays are reconfigured, if needed, before 
the execution of a task. 
A further improvement in ·systolic· arrays comes from the use of bit-
sequential processing ·elements. Such elements are smaller and easier to design. 
. . 
2 
:\ <l di t iouall y., i u ter-proct>ssor ronHn u 11 ic at ion is grc~at. I y si rn pl i fied, si nee all data 
buses are one only bit \vide. Both these factors contribute to more efficient use 
of \'l.JSI area, n1aking this a very attractive option in systolic array designs. 
1.1.2 Current Technology 
The Massively Parallel Processor (MPP), designed in 1983 by Goodyear 
Aerospace Corp. under contract from NASA Goddard Space Flight Center, is 
the first example of a bit-sequential systolic array [5,6]. The MPP is composed 
of 16~368 processing elements arranged in a square array. This SIMD {single 
instruction multiple data) machine performs image processing on data collected 
from satellites. The first commercial1y available reconfigurable systolic array, 
the Geometric Arithmetic Parallel Processor (GAPP), with bit-sequential 
modules, was marketed by NCR Corp. in 1984 [2,7]. This MIMD (multiple 
instruction multiple data) array is composed of 72 processing elements arranged 
in a 9x8 grid on a single VLSI chip. Because of the MIMD nature, the 
applications of this array are wide and varied. There is an ongoing effort at 
Lehigh University, and at other schools, to investigate various aspects of 
reconfigurable VLSI bit-sequential systolic arrays. 
1. 2 Fl111ction of Interface 
The systolic arrays previously described typically have a general purpose 
host computer as a user interface. The data and instructions resident in the 
host computer must be communicated efficiently to the systolic array to take 
full advantage of its processing power. This thesis is concerned with the 
development of techniques necessary to accomplish this efficient transfer of 
information. 
3 
•• 
Due to the increasing popularity of bit-sc·quentia) systolir arrays, the need 
for such interfaces is~ no\\' greater than ever. Design of thes<1 in tcrfaces is 
complicated because of the differences in speed and data for mats of the host 
and the array. It is not unusual for bit-sequential systolic arrays to have 
clocking rates in excess of 30-40 MHz, whereas baud rates of a typical host are 
limited to 19.2 kHz. Similarly, the data format of the host computer is a fixed 
size word which does not necessarily match the bit-slice format requirement of a 
systolic array. Further, tJiis bit-slice format is dependent upon the particular 
' 
application, in the case of a reconfigurable array, g1v1ng rise to additional 
complexity in the design of the interface. 
NASA's MPP system uses a "staging memory" to effect data reformatting 
[5,6). This staging memory consists of three blocks: the main stager; the input 
sub-stager; and the output sub-stager. · The main stager communicates with the 
two sub-stagers and these, in turn, interface with the front-end computer and 
the array. The staging memory is programmed to manipulate data flo\ving 
through it by the staging men1ory manager, ,vhich receives fron1 the front-end 
con1puter a description of the data and task to be perforn1ed. 
l 
Most of the memory is in the main stager, which is made. up of 32 RAM 
banks (in a fully populated memory) arranged in 64-bit \vords. Durjng one 
, 
cycle of 1.6 µsec, each men1ory bank can re.cei ve and transn1i t one word for a 
transfer rate of 5 I\1byt.es/sec per bank (160 Mbytes/sec overall) for both the 
input and output. 
The sub-stagers' memory is smaller and composed of 128 I-bit RAMs 
(~ince a bit slice for the MPP. array is 128 bits), .allo\ving data to be accessed 
•" ... • I 
and rearranged bit by bit. Cycle time for the sub-stagers is 100 nsec, during 
4 
.. _. .•. ~ 
• I 
,. 
which 128 bits can be read and v.1rittcu. 1"lu·s<1 bits are packed into ·1-bit 
nibbles to reducc1 the wire count bctw<>c11 th<' n1ain stager and the sub-stagers. 
~ 
Data is transferred in 4-bit nibbles, 32 nibbles every 100 nsec. Incorporated in 
each sub-stager is a permutation network to rearrange the data before being 
stored in the. memory. 
1.3 Objectives of Thesis 
This thesis is devoted to investigating the architectural features of a host-
to-array interface. Design problems have been identified and solutions to them 
examined, with results of these studies presented in Chapter 2. Some of the 
more critical design considerations we.re found to be collecting data from the 
array; matching memory and array bandwidths; and reformatting data between 
host and the array. We will elaborate briefly on these points in the following 
paragraphs. 
The array generates data in formats based upon its current configuration.· 
It is the function of the data collection net \vor k to con vrrf these di ff eren t 
formats to the data forrnat of the host. Our solution to this problcrn allo\vs 
the array to execute more than one task simultaneously v.:hilc ensuring that any 
results generated are collected \vithout any tin1ing conflicts (Section 2.6.4)~ 
Since an array works at clock rates rnany tin1es great.er than the presently 
availab]e memory elen1ents, the array data st.or age and retrieval from this slo\v 
memory presents. a bottleneck. In addition, the memory· must service the needs 
of the host in transferring data to and from the array. The solution to this 
problem is to time multiplex RAM chips so that the effective memory band\vith 
. ii~t'tisfies the ... d'emands- of· both the array an·d h·ost .. sirnnltaneousiy _. ,, Details of ·th'is 
are presented in section 2. 7. 
" 
5 
Conversion of input data fron1 th<· "'ord parallel format of the host 
cornputcr to the serial nature of the array n1ust be performed by a circuit that 
is configurable to different formats as required by the array. The solution 
presented in section 2.3 suggests a two stage data manipulation: first stage 
\ 
performing a row-to-column permutation (if necessary) and the second stage 
implementing a generalized parallel-to-serial conversion. 
1.4 Organization of Thesis 
Chapter 2 of this thesis summarizes techniques which would be employed 
in. the design of a general interface to be used between a host computer and a 
bit-sequential systolic array. Several key subsystems of this interface are 
exarnined in great detail and, wherever possible, designs which are optimum in 
I 
terms of hardware minimization are presented. Pipelining is used extensively in 
order to achieve speeds compatible with the array. 
A specific implementation of the interface previously described is presented 
in Chapter 3. It· is used in an ongoing research project, in which the array's 
flexibility demands are not as great as a general bit.-s<~qucntial systolic array. 
This has enabled us to present a sin1pler . but cornplete design, and to discuss 
specifically the control of the interface through. the use of a bit-slice 
microcontroller. Strict adherence to optiniurn subsysterns detailed in Chapter 2 
.... 
is relaxed in the design of some circuits of this interface due to the required use 
of commercially available parts. 
Finally, Chapter 4 summarizes the results obtained in this work and points 
out areas that need f urthcr investigation. 
. ·, . ., 
6 
. · 1 . I 
• J 
Chapter 2 
General Interface Design 
2.1 Introduction 
The principal goal of this thesis is to investigate efficient interfacing 
between a general purpose word-parallel host computer and a high speed 
application-specific bit-sequential systolic array. 1"'his chapter presents the design 
concepts of a universal interface that is adaptable to any size host and array. 
The design wi11 be such that the circuit is flexible enough to work directly with 
various hosts and arrays. 
2.2 Basic Desig11 Co11siderations 
The interface circuit between the host computer and the array of PEs 
consists of several smaller circuits, or blocks, interconnected to perf orrn the 
required data staging functions. Essentia1ly, what the interface needs to 
overcome is the misrnatch in data rates of the array and the host and the fact 
that the host computer operates on data in a word-by-word manner and the 
array in a bit (or bit slice) sequential format. 
A block diagram of the interface design is sho\vn in figure 2-1. The Input 
and Output FIFOs and Conditioner & Memory blocks are used to bridge the 
gap in data rates between the host and the array. The Input and Output 
Stagers, together with the ParalJel-to-Seria] Converter and Switch Network, are 
responsible for= manipulating data into the proper format of the unit receiving 
the data: word-by-word for the host or bit-sequential for the PE array . 
. ( 
on the task to be performed. A particu]ar ctJB.L1guration is dictated by the host 
7 
Output 
FIFO 
Output 
Conditioner 
& Memory 
J\ 
Output 
Stager \ 
' '\ 
) 
-
HOST 
,, 
Input Instruction 
..,_ __ _. Register/ 
..,_ _ __. Decoder 
FIFO ...___,__----6 Controller 
"4 J 
Input 
Conditioner 
& Memory 
'I 
Input 
Stager 
Figure 2-1: 
RESET 
Switch 
Network 
Parallel 
'" 
-to-
/ 
Serial 
Converter 
Interface l~lock Diagram 
PE 
1 Array _ 
Permutation 
~ 
· Network 
-
V 
computer by means of instructions downloaded to the interface and interpreted 
appropriately by the Instruction l~cgister/Cont roller block sho\vn. Since both 
data and instructions will be on the sarne bus between the host and the 
interface, a Decoder circuit is required to decide whether the word on the bus is 
a data word or an ·instruction and to route the byte to the corresponding 
.. 
destination. 
The Decoder works by assuming that any byte on the bus is data unless 
',er . ..I,( • .. • • 
it is the all O byte. If it- i's -- the 'a1l ~ 0 byte;·· th err ·t1r~- next byte ·is an · instru-ction 
8 
.. 
,J' ,· 
\ ) 
unless the next byte is also all 0, in which case the data is all 0. Obviously, 
then, the all O byte may not be used as an instruction. The Decoder also 
responds to a RESET instructio.n to configure the interface to a pre-determined 
format. This insures that the host computer can always regain control of the 
interface. A state diagram of this machine is drawn in figure 2-2. 
Pwr 
-Eq 
Eq: Byte matches all 0. 
P: Panic, byte matches master reset instruction. 
Enable Data FIFO in state A. 
Enable Instruction Register in state C. 
Enable !\'laster Reset in state R. 
' 
Figure 2-2: Decoder St.ate I\.1achine 
The Decoder and t.he Instruction Register /Cont.roller \\·ill not be discussed 
in detail here, but the Instruction Register/Controller could well consist of a 
programrnable microcontroller. A specific design of such a controller will be 
prcsen ted in the next chapter. The rest of the subsystems con1prising the 
interface will no\\' be examined in depth in the follo\ving sections . 
. ,.: .. -{ ' .. :: ·, . ' --{ . . f "l • ' ' ' ,,l,_ 1 .( I • ~·•-' r,~ ( 
9 
1:\9 . . .. 
I 
t/ 
·, 
2.3 Input St11ger nncl RAM 
The Input Stager, together with the RAM (within the Conditioner & 
Memory block, which will be discussed in section 2. 7), allows the input data to 
be permuted from row to column form for up to NA rows at a time (where NA 
is the maximum number of distinct data lines to the array). That is, the first 
word stored in RAM may not be the first word of data sent by the host but, 
rather, the first bit of D (D < NA) consecutive words; the second word in 
RAM is made up of the second bit of these D words; and so on. Figure 2-3 
illustrates this function (sho,ving on]y the first three v.·ords of the host and 
RAI\1) for NA ==D==8. 
c7 c6 cs c4 c3 c2 cl co 
Words 
b7 b6 b5 b4 b3 b2 bl bo from 
HOST 
a7 a6 a~ a4 a? a') al ao ~ J ~ 
-
-
--..... 
h ') g2 f2 e2 d') c2 b2 a2 
"' 
... 
\\~ords 
. 
hl gl fl el dl cl bl a l ID 
RAI\1 
ho go fo eo do co bo ao 
Figure 2-3: Ro\\1-to-Column Permuting 
Th is con fig u ration (first bi ts in fi rs t \\'or d ~ sec on d bits next ,v or d , etc.) is 
necessary to execute bit slice oriented tasks in the array of PEs. If no row-to-
column perrnuting stager existed, the first ·bits· of each v.·ord ,vould have to be 
·accessed· sequentially - (only, .one. ,,memory" word.-~can . be., a,:cessed, .. in .. a. given. clock 
period). Since the Input Stager, once initially filled, has a throughput of unity, 
10 
'·' 
·(;. 
' . ,..- ., .... , .. 
it rcprPsents a vast increase in speed over spquential access. 
Figure 2-4 sho\vs an Input Stager that \\'ould be used \\'ith an N-bit RAM. 
It consists of N2 cells ( arranged .. in an N by N square) each with a flip-flop and 
a 2:1 multiplexer as shown in figure 2-4. There are N 2: 1 "collection 
multiplexers" that lead data to the Parallel-to-Serial Converter. Operation of 
this circuit is straight-for\\'ard: data enters the circuit from the top and 
propagates down one row per clock with permuted data being collected from the 
bottom row until all the data (up to NA words) has entered the circuit. At 
this time, data flo\v changes direction (by changing the common select line of 
each 2: 1 multiplexer) and will now enter from the left edge and propagate 
horizontally. Permuted data is now collected at the right edge of the circuit. 
2.4 Switch Network 
Data stored in the Input Memory is transferred to the array through a 
generalized Parallel-to-Serial Converter and the Switch Network. These two 
subsystems stage the data to the format required by the array. In addition, as 
explained in section 2.3, the Input Stager also participates in the data staging 
operation. Of these tvt10 blocks, the Switch Network is a conceptually simpler 
c i r cu i t. Its function is to r o u t e bits from any output line of th c Par al I el-to-
Serial Converter to the appropriate row(s) of the array of PEs. This is 
accomplished by si1nply providing R. NA: 1 n1ulti plexers where R is the number 
of ro\vs of the PE array ( one multiplexer per row) and NA is the number of 
output lines from the Parallel-to-Serial Converter of the next section. This 
configuration allows maximum flexibility: any bit can be routed to any row, 
including mulfj:.'Je. ·destinations. ,: .. t I ·,· 
11 
.. . 1 
.A 
A 
A 
Across 
D 
• 
• 
• 
D 
D 
·.•. ""'· • ·"- ,; . .., t ·< \' 
0 
0 
0 
Down Dr /Acr 
' Sel 
. 
D 
Q 
\ 
Input Stager Cell 
A 
A 
A 
\ 0' 0 I 
• 
• 
• 
D 
D 
D 
~n--. ... - ••• 
0 
1--~-.... - ••• 
0 ,_________ . . . 
0 ' / 
12 
Output 
A 
A 
• 
• 
• 
D 
D 
D 
. ..------
A 
0 
0 
0 
. 
2. 5 c;e11erl1lizPc] P 111·ullel-t<>-Se1·i11l Co11vertcr 
I 
l)ata that is stored in the Input Mcrnory in parallel fashion must be 
converted to some serial format before being presented to the. array of PEs. 
Given that a word • 1n memory . JS of the form (for N ==8 A 
be treated. These forms are shown in figure 2-5. 
a 0 Al] bits to one row 
a6 a4 a2 ~o 
a7 as a3 al 
Bits to 
two rows 
Figure 2-5: 
a4 
as 
a6 
a7 
ao 
al 
a2 
a3 
ao 
al 
a2 
Bits to 
four rows 
a3 Bits to 
a4 eight rows 
as 
a6 
a7 
Output Formats 
bit word) 
These formats are achieved by reading the Input Memory word-by-word 
into a shift register and shifting the bits out. There are essentially two choices 
for implementing this process which \\·ill be referred to as: 
I. "Simple-Shift/Complex-Loa,d and 
2. Simp]e-Load/Comp]ex-Shift. 
13 
' 
'fbesP t \\'o approaC"hes ,viii 110\\' be discussed and cornparcd in terrns of their 
ha.rd\var<' compl<lxity. 1"he roinplexity of a boolean expression is equal to the 
number of its literals tirnes the size of each. Thus the complexity of an n:1 
multiplexer is 
(# of terms)•(size of each term) 
Thus, since a 4:1 multiplexer's 
. . 
expression is 
its complexity is 12. 
The concept of simple-shift/complex-load is that, when in shift mode, bits 
proceed one cell at a time to,vard their output destinations. That is, if a bit is 
currently in cell c. then it will next be in cell c. 1. The loading of the register I I· 
is arranged to achieve the desired output staging. Figure 2-6 shows an 8-bit 
simple-shift/con1plex-load shift register. Notice that the output is from different 
cells of the shift register depending on the format of the output (number of 
bits/clock). For instance, if the output is t\\·o bits per clock (even indices on 
one line, odd indices on another) then the output is from cells O (even indices) 
and 4 (odd indices). If the output format is four bits per clock then cells 0, 2, 
4, and 6 generate the out.put. 
The corr1plexity of the simple-shift/complex-load form can be calculated as 
f ollo\vs ( for the exarnple of N==8): 
• 6) 4:1 multiplexers: 6•(4~(log2 4 + 1)) - 72 -
1) 2:1 multiplexer: 1•(2*(log2 2 + 1)) - 4 -
76 
8) D-type Flip Flops 
I'' •.lj; ~ ,+<.£ , I r,·! ., 
A simple-load/cornplex-shift configuration (fi!;Jre 2-7) is different in that 
the parallel data is always loaded into the same cell of the shift register: data 
14 
~- I I 
I ) 
.V 
..._ 
T. " ~ 
/ 
-
....... 
1. , 
"· 
V 
-
... 
" 
' 
/ 
-
.... 
T , 
~ 
(! ( V 
-
.... 
T , i 
"'· 
~ 
V 
I 
... ... 
·T , 
i".... 
/ 
-
..._ 
" 
0 r-
d d 
Figure 2-6: Sim pl e-S 11 if t/ Comp I ex-Load 
15 
._ . 
. .
'-- - -
... 
bit a. is loadPd into cell c .. lio\VP\'(lr, cell c. is not n<~cessarily transfcrr<)d into 
I I I 
c. 1 \Vhrn in th<) shift n1ode. For instance, if the output forrr1at is two bits per I· 
clock, then c. is transferred to c. 2• To generalize, if the format is / bits per I I· 
clock then ci is transferred to ci-l at the clock edge. This parallel-to-serial 
conversion technique will generate the I output bits in the right-most I ceHs of 
the shift register . The complexity of an 8-bit simple:load/complex-shift 
. 
converter Is: 
4) 4:1 multiplexers: . 4•(4•(log2 4 + 1)) - 48 
2) 3:1 multiplexers: 2•(3•(log2 3 + 1)) - 16 -
1) 2:1 multiplexers: 1•(2•(log2 2 + 1)) - 4 -
68 
8) D-type Flip Flops 
, 
Obviously, there . IS a reduction . In complexity with the simple-
load/complex-shift approach and this 
. 
1s our method of choice for parallel-to-
serial conversion. 
2.6 Accessing Results 
2.6.1 Overview 
Data results generated by the array are available one column (slice) at a 
time. This column is the right-rnost PE of each row of the array. Just as we 
assurnc that input data to the array will be 2n bits per clock (n > 0) we will 
allow data output from the array only in groups that are powers of two. In 
other \\'ords, serial data that is to be grouped together as the same word in 
memory must be outputted one bit per clock (one row), two bits per clock (two 
rows), four bits per clock (four rows), etc. ·up to N==2n bits per clock where N 
. h "· ·d." f 
.is .t e. wor , .siz.e. o the array. 
Figure 2-8 depicts how results are captured from the array of PEs. Since 
16 
/ 
... 
" 
~ 
-
~ 
-
V 
-
... 
T , 
~ 
. 
/ 
-
.. 
T r 
V 
... 
T , 
" 
/ 
-
.... 
, 
; 
" 
. 
/ 
-
--T , 
- ~ 
V 
....._ 
T , 
~ 
V 
-
... 
'T ,, 
0 
d 
r-
. 
(j 
,t.'•. 
Figure 2-7: Simple-Load/Complex-Shift 
, .. 
17 
the array is of row size M and the word size is N, at most N of the M rows 
may be accessed at any one time. 
PE 
Arra 
Permutation 
' 
Network 
Output 
Stager 
Figure 2-8: Accessing Results 
M may be greater than N necessitating an M-to-N multiplexer to select which 
rows will be accessed (fewer than N rows may be active but this poses no 
problem). 
The Permutation ;\etwork can re-order the N input lines (r0r 1 ••• rN_ 1) to 
all possible pern1 u tations at ttre output. The need for this network will be 
subsequently explained. 
The Output Stager circuit takes as input the N lines of output from the 
Perrr1utation Net,vork (\\·hich is basically serial in nature) and outputs an N-bit 
word at a rate up to one word per clock ( \vhen there is active data on N 
lines). The Output Stager is modular in design, with each module consisting of 
a serial-to-parallel converter, N 2:1 multiplexers, and an N-bit latch.· All 
components of the data access circuit \\'ill now be discussed in more depth. 
. ,, 
1,. 
. , 
.. 
18 
•,,, 
. ,, ..... · "',{ 
. . • "iii 
. , 
--, 
2.6.2 Permutation Network 
The Permutation Network consists of q butterfly networks where q is 
0_ 
calculated from 2q > N!. For N=8, N!=40,320 which means that q=16 butterfly 
networks would be needed. This can . be implemented as four stages of four 
networks each. Figure 2-9 shows an example of how the butterfly networks are 
implemented by a Permutation Network for N==4 ( 4!=24, q=5). 
2 ControJ 
Lin es.____ __ 
2 Control 
Lines 
Figure 2~9: 
I Control 
Line 
l>ern1 u tation i\ et V\i·or k, N =4 
Each butterfly net\vork consists of two 2:1 multiplexers sharing a common select 
line (see figure 2-10). Each butterfly network has a unique control line so there 
are q control lines to the Permutation Net,vork . 
. , 
19 
. i 
' ( , 
• I 
_, 
a 
MUXr---------------~x-
b 
Control 
Figure 2-10: 
2.6.3 Output Stager 
Se 
/' 
~ 
Se 
MUX 
Butterfly Network 
y 
Figure 2-11 sho\vs one of the modules used to construct the Output Stager 
circuit. There are N==2n modules making up the Output Stager circuit. 
A. 
lil, 
I 
.... 
/ 
Serial 
-to-
Parallel 
Converter 
•' 
I\1UX N 1 , Latch 
~-..--:...:-.,.-,-----, 
I 
B. 1n 
I ' ~I,( - -f 
,, Figure 2-11: Output Stager Module 
20 
.. N -
' , 
r •,,,. ( 
,i.:. -- .. 
I 
'f Ii e basic op Prati on is t hat data fro n 1 t he I> <1 r 111· u ta.ti on Net \Vo r k , \\' h i ch is 
prirnarily serial in nature, is clocked into a rnodule's shift register. When the 
shift register fills up it is durnped into that n1odule's latch. The data byte in 
the latch then propagates toward the output by being transferred to the next 
higher (lower index) module with each clock cycle until it reaches the top 
(output) module. 
These modules are not strictly identical: the serial-to-parallel (shift 
register) converters are of different degrees of flexibility depending upon the shift 
register ,s position relative to the final output latch. This is· done to reduce the 
total hardv.1are complexity without affecting the system's flexibility or 
.. 
expandability. If desired, all the shift registers can be made identical. Modules 
are interconnected as shown in figure 2-12 with D t of lower modules connected 
OU 
to B. of the rnodule directly above it. A. lines to al] modules are from the 
in in 
Permutation Network. 
The Output Stager must be capable of handling output data generated by 
the array in f orn1ats analogous to thosf' at the array input. That is, data 
belonging to one task (that which should be stored together) may be generated 
from only one row (one bit per clock)~ from two rows (two bits per clock), etc. 
up to 2n ro\vs (2n bits per clock), \\'here 211 ~N. In addition, the Output Stager 
should be capable of servicing more than one task being executed concurrently 
by the array, providing the total number of bits per clock generated by all the 
tasks does not exceed N. 
Figure 2-13 shows the Output Stager for the case·· of N==2n==8. When 
there is one task producing eight bits of output every clock, the top module 
,.. . < , 
(with 8 input lines)~ is the only module utilized and· is dumped into Jatch0 at 
21 
;;, 
( 
Permutation 
Network • I .. .Output . 1 
" 
I , ... 
N Module0 'N 
, 
'\ 
~ v 
' 
•' 
"" 
I .... Module 1 I ~ 
' \ 
N, ~ 
... I ... 
~101dule2 
' 
, 
' 
~ 
\ ~ 
• 
• 
• 
J 
' 
J 
_, 
N 
·-
.,., 
"- - rv1 od lJ IC N-1 , 
' . -. ' 
. 
Figure 2-12: Output Stager 
,. 
22 
i 
, ..• 
? 
.. """' .-
.. .... 
• - ? 
C, 
.. 
.. / 
. 
" 
,. 
" .,. 
.... 
/ / 
.. 
-~ 
, 
\. 
. 
" , V 
~~ 
., 
.. 
, 
... 
,, V 
.. 
" 
,. 
. 
, 
-
.,. V 
\ 
... " 
-
-
., 
... 
,. 
~ 
-- i"-.. , 
..... 
, 
~ 
,~ 
. 
' 
I 
I ~ -, 
.. 
,, 
' 
~ I 
,r: .. 
Figure 2-13: Output Stager, N==8 
23 
~ .... , ••• ,1 
J 
C\'(•f\' rlork . .i\t th<' otlH·r c-xtre1r1C', if t·igbt scparat<1 tasks arc running 
. 
ro11rurrc,ntly and each is generating one bit of output per clock then eight 
rnodulPs ar<~ artivP and arc being durnped into their respective latches every 
eight cycles. 
The question that then arises is: "If the total number of bits per clock 
generated does not exceed N, can modules al\\·ays be assigned su~h that, for any 
combination of tasks, all data from those tasks can be collected?" If so, then 
maxirnurr1 efficiency is achieved. Since data in the modules' latches propagates 
up\\·ard one module at a time, there is the possibility of "data collision" if a 
data word from a lower module. 1 arrives at module. at the same time that I -t I 
shift register. is dumping collected data into latch .. 
I I 
Consider as an example a system with N =8 and two tasks being executed 
concurrently, each producing four bits of output per clock. Thus there are 8 
bits of output per clock being generated by the t\vo tasks and, since this does 
not exceed the word size (N=8, in this example), all data should be collectible. 
Sillce each task produces four bits per clock, each task will fill an 8-bit shift · 
register every other clock. Therefore, each shift register will have to be dumped 
into its latch every other clock. Clearly, then, data from these two tasks 
should not be routed to modules that are separated by one module (module0 
and mod ule 2 for example). If they are, then when the data originally collected 
in n1od ule2 propagates to module0 (2 clock cycles), the shift register of module0 
will be dumping data to its latch and data collision will result. This situation 
is remedied if adjacent modules ( e.g., n1odule0 and module 1) are scheduled. 
Actually, data collision in this example will result any time the two tasks' 
~ 
' .. : ,- .· I 
outputs are scheduled for modules whose distance ( difference in modules' indices) 
24 
• 
f 
i s a r n u I t. i p I e o f t \V o a 11 d n o co I I i s i o u \\ · i I I res u l t o t h c r w i s c . 
()bviously, to achieve rnaxirnurn efficiency ( thr. ability to collect N bits of 
output per clock), proper assignment of tasks to rnodulcs must be employed. 
Following 
. 
1s an algorithm for scheduling such that, for any combination of 
concurrent tasks generating no more than N bits of total output per clock, an 
assignment of tasks to modules results such that all bits can be collected 
without data collision . 
. . 
2.6.4 Task Scheduling 
Define the weight of a task, w, to be the number of bits per clock 
generated by that task. We will allow w to take on 9nly values that are 
powers of 2 ( 1, 2, 4, 8, etc.). The following algorithm is a means of scheduling 
tasks to the Output Stager modules in such a way that data collision, as 
previously described, will not occur. 
Task Output Scheduling Algorithm 
1. Choose S, the set of available indices, to be {O, 1, 2,... N-1 }. 
2. L,et s be the minimum element of S and U' the maximurn element of 
W . 
. 
3. 1\ssign task of weight w to module s. 
4. Delete w from W and all elements j from S where j-::= s + k·(lV/w). 
5. If set W is not empty then return to step 2. 
output scheduling is complete. 
Else the required 
Before presenting the proof of collision avoidance, the algorithm wilJ first 
be illustrat'ed with the following example. Consider a case where N==8 and the 
·1a~k~:.· ~~:, ~e .
1
sc,hedul.~d are: t,as.k 1, producin~. f,~~r b1ts per c~~c~;, t~sk2, produci~g . . , _ .... -u~. •. 
two bits per clock; task3 and task 4, each producing one bit per clock. 
25 
Therefore, \\1 --{·1, 2, 1, I} and S -{O, I, 2, :~, 4, 5. fi, 7}. Tasks are then 
assigned to n1odules as follo\\'S: 
1. The largest w is 4 and the smallest available index is 0. llence 
assign task 1 to module0 . Eliminate ind"ices 0, 2, 4, and 6 yielding 
W == { 2, 1, I } and ·s = { 1, 3, 5, 7}. 
2. Schedule task 2 to module1 and elirninate indices I and 5. Now 
W == { 1, 1 } and S == { 3, 7}. 
3. Schedule task 3 to module3 and eliminate index 3. This leaves 
W=={l} and S=={7}. 
4. Schedule task 4 to module7 . This concludes the task scheduling. 
Thus, the modules used for collecting data in this exarnple are: 
module
1
, module3 , and modu]e7. The remaining rr1odules serve only as links for 
data word propagation. The reader can convince himself that there are no data 
• 
collisions with this scheduling. 
We now prove that the algorithm described above yields a task assignment 
with no collisions. 
Tl1eorem 1: The Task Output Scheduling Algorithrn described 
earlier gives a schedule to avoid collisions between outputs of tasks. 
Proof: Let module. and n1odule. be two arbitrary rnodules. Assume ('Ai·ithout 
I J 
loss of generality) that i < j. 
which are multiples of (N/w.). 
J 
Oat.a \\·ords are collected in rriodule. at times t 
J 
These words then propagate through the lo\ver 
order modules and arrive at module. at times t given by 
I 
t ~ (j-i) mod (N ju,.). 
) 
However, module. collects its own data at times t' such that 
I 
. In order to avoid . conflicts between .the .. collected data .. in~"-·--'Il.odule., and .. ,. . the 
. . , . . I . . i .• 
propagating data from modulej,· one must satisfy the inequality 
• 
26 
Sine(• i < j ,U 1) I u1i, or (/\'/u't) i (/\'/u·J). rrhus, t.o avoid collision, one should have 
( Iv' I UJ i) ~ ( j - i ) 
or j t i mod ( N / w.). 
1 
However, according to the scheduling algorithm, j is picked from the set of 
indices which has had indices of the type i mod (N/w.) already removed. 
Hence, j t i mod (N/w
1
.) and collision between data from module. and module. 
I J 
is avoided. Q.E.D. 
The proof of Theorern I indicates that there would be a· collision between 
data outputs of modu]e. and module. if, and only if, 
I J 
This is avoided by proper association of the modu]es with data weights in the 
task assignrnent algorithm. The next obvious question is "Can a1l tasks' output 
/ 
be co]lected using the scheduling algorithm described?~ The follo\ving theorem 
answers that question. 
Tl1eore1n 2: (Algorithm Coverage) The task scheduling 
algorithn1 presented earlier can schedule every available task provided 
h \---. t at ( .: _ u . u1) < N. 
L....J w E: ~~· --
Proof: The required result is proved by induction over the cardinality of set 
\\/. We sho\v that S is non-empty as long as W is non-empty and, therefore, at 
any stage one can always associatP the srnallest s E S to the largest w E W. In 
particular~ \Ve sho\v that (Lw E iv w) <- IS I at any stage of the algorithm. This 
is true at the starting stage by assumption of the theorem. Suppose it is true 
at a particular stage and let f1 be the largest element in W at the time. 
According to the algorith"in, assign to n the smallest index s available in S 
·(step #3. of .the aJgorjthm ). Th.is ,then .r.emoves .. from S all integers. of the -form 
{6 + k·(N/11)} (step #4). But there can be only 11 such distinct integers (for k 
27 
\ 
I 
' 
\ 
/ 
0, 1, ... 12-1) sillre each of the111 rnust b<· l<)SS than 1'. rrhus l SI decreas<~S at 
rnost bv f1 . 
. 
l·lowever, since t1 is IlO\\' dropped f ron1 \\', Lw E w w decreases by n at the 
next stage. Note that the assignment n to s is compatible with any future 
assignment of any rema1n1ng j E S to a 
. . 
rema1n1ng f2 ' E W because non-
cornpati bili ty (i.e. collision) bet \Veen rnod ule5 of weight f2 and modulcj of weight 
n' would imply that 
( j- s ) I ( N / max { 0 J1 ' } ) 
But, max { fl ,n '} is n and this wou Id rnean j · s rnod ( N /0). Since j belongs 
to a set of indices obtained by dropping all indices congruent to s mod (N/0), 
the non-corn pati bi Ii t y is a voided. Q.E.D. 
We next consider the problem of m1n1mizing the complexity of each 
module. In particular, we attempt to design modules with the m1n1n1um 
nurnber of data input lines \\·hich \\1 ill still support all possible task weight sets 
W. i\ non-rninirnal solution is to connect each output line from the Permutation 
. 
Net\vork to f'ach rnodulc. 1,his is undesirable since, as \VC will show later, the 
corn plex i ty of a n1od u le is dc·pcndcn t upon the number of its in put lines. For 
clarity, the solution to the problem will first be stated, followed by a proof of 
its rninirnalitv. 
The Output Stager consists of N==2° modules labeled with indices 0 
through N-1. Each of the n1odules accepts a certain nurnber of output lines 
from the Permutation Nrt,vork. Let M. denote the set of indices of the lines 
, I 
entering module.. These sets are chosen in the very specific manner indicated 
1 
• 
below for reasons that will becorne clear later. 
28 
I 
l 
Data Collectio11 Network: I::ach M. st.arts with thf~ bit-rev<!rsed i and 1 
USPS COJlS('ClJtive N/2flog2(i+I)l integPrs. 
rrhus, the> first set, M0, has N elements, the second set N /2 elements, the 
third and fourth sets N/4 clements, and so on until the final N/2 sets with one 
element each. For example, if N =8 then 
M0 = { O, 1, 2, 3, 4, 5, 6, 7} 
M1={ 4,5,6,7} 
M2={2,3} 
M3=={6,7} 
M4=={1} 
M5=={5} 
M6~{3} 
M7=={7} 
We now prove 
sufficient and rninimal. 
~ 
, 
I 
that 
Index 
000 
001 
010 
011 
100 
101 
110 
111 
the Data Collection 
First 
Elem 
000 
100 
010 
110 
001 
101 
011 
111 
I\etwork 
no. of 
el ems. 
8 
4 
2 
2 
1 
1 
1 
1 
presented here 
Theorem 3: (Sufficiency of Data Collection Network) The 
Data Collection Network presented earlier is sufficient to collect data 
from any set W of tasks (of weights u,) providPd (LwE "! w) < N. 
. 
IS 
Proof: In order to show the sufficiency of the data collection net\\'ork we first 
prove that for any arbitrary weight distribution set, \V, every module can be 
configured to collect data on distinct output lines from the Permutation 
Network (i.e., the proposed network does indeed provide distinct lines to distinct 
modules). If this is true, sufficiency follows because for any W with 
(Lw E w w) - N, all tasks together generate fl output bits in one clock (if 
(Lw E w w) < N then add pseudo operations of weight w to satisfy this relation). 
If each module picks up w distinct bits per clock then the network is sufficient 
for all the modules together to gather all · the · N· ''distin·ct bits _per, clock. This 
implies sufficiency of the network. 
29 
.,, 
- ---........ 
' ' . 
'I'h us, to pro\· c su ffi c ienc y: 
l.1ct ~=2°. Consider two arbitrary modules i and j of the Output Stager 
such that, expressed as binary strings, 
. 
I 
and 
. [ 00 ... Ob lb 2·.' b 1 boJ · J q- q-
L ,J \..,, 
,,.,I 
" 
I v-
n-q q 
It is no\\' shown that if the Data Collection ·Network specified earlier is used 
then module. and module. collect distinct bits. 
1 J Define o and f3 to be bit-
reversed i and j, respectively, i.e. 
\\
1
e no\v compute the elements of set J\1 .. 
1 
Clearly it starts with 
(bit-reversed i) and has N/2 109z(i+l) := 2n/2P ~- 2n-p elerncnts. Hence, 
S i rn i I a r l y , 
( 2. l) 
(2.2) 
2n-p r· 
' 
Assume (without loss of generality) that i > i. Then l At. I > IM -1· The 
' z - ) 
san1c data element picked up by module. and rnodule. implies that an element 
1 J 
in I\1. is the same as an element in M.. That is 
1 J 
r·2n-p + k. = s·2n-q + k. 
i ) for some k. and k. I J 
.. 
~· { ,, . ...1 . ~ .. , . ( -( . . '• ,f .... .,--,._ " • -,. 1 'f"(.~ ., 
Let q == p + t. Then the same equation can be rev.,ritten as 
'. 
30 
or 
'>" q 6 .... 
B -- r·2q p -t k/2" q 
=-~: r•2t -t- C • 
In this last expression, 
C = k/2n-q < 2t, 
· k k k < k < ( 2n ~ P == 2t) since . = . - . . _ 
' J - • 
k 
I A· . ' ) 
Thus the last t bit positions in the binary expansion of s would be 
determined by c and the rernaining n-t by r. 
take the form: 
s 
The n-bit string for s would then 
Comparing this string of s with the string in equation (2.2), one gets that, for 
collision, 
c1=-bq-2' co==bq-1· 
a= 1 ,a 1-= b I' p- p- and 
-
Thus the binary string j rnay be expressed as 
. 
J [00 ... ct ')c lb 1··· 
- ... t- p-
Corn paring th is \Vi th the string for ·i 
i == [00 ... 
one gets 
0 b ·1 b 2· .. p- ·p-
I1o\vever, note that since i < j, "\? w1 , and 
. \ 
Thus, a collision implies: 
31 
• 
(A'/rnax{n·,,u·1}) ! (j--i), \ 
or ( N / w i) I ( j-- i) , 
or j- i mod (N/wi). 
However, according to the task scheduling algorithm presented earlier, J is 
chosen from a set of indices which have already had elements removed that are 
congruent to i mod (N/wi) . This contradiction shows that elements selected 
from module. and module. are distinct, completing the proof of the sufficiency of 
l J 
the Data Collection Network. Q.E.D. 
We will now discuss the measure of complexity alluded to in Theorem 3 
which is related to the number of input lines to an Output Stager module. We 
stated earlier that module complexity increases as the number of its input data 
lines increases. The variable in the module complexity is all within the 
module's serial-to-parallel converter, since .the multiplexer and latch size are 
determined by N, and is fixed and constant for all modules. 
By way of example, \\·e \viii sho\v that rr1odule complexity increases as the 
number of input lines, /, increases. Figure 2-14 shows the data collection circuit 
(serial-to-parallel converter) for \vord . SIZC N==8 and number of input lines l==-8 . 
For this module, data collection can be one~ two~ four, or eight bits per clock. 
Therefore, every celJ must be connected to some Li (to collect eight bits per 
clock); the four right-rnost cells to a ce1l a distance of four to its left (to colJect 
four bits per clock); the six right-most cells to a ceJl a distance of two to the 
left ( to collect two bits per clock); and the right-most seven cells to the cell to 
its immediate left ( to co1lect one bit per clock). Th us, for N =8, /=8 one has 
· · the· f ollowi1Yg-: 
32 
') 
--, 
PN1 
" 
y 
. 
.... 
-
.,. 
PN6 ~ 
/' 
-
...._ 
.,. 
PN5 
• V 
-
.. 
, 
PN~ "· 
. 
V . 
:.., 
.... 
, 
PN3 ~ 
- / 
-
... 
r , 
' 
. 
I / 
-
..._ 
T ,-
PNI 
' 
V 
......_ 
r , 
PNO 
... 
Figure 2-14: Data Collection Circuit: N ==l==B 
33 
I 
... 
4) 4: l multiplexer 4 [ 4 (log2 4 + 1) J - 48 
2) 3:1 multiplexer 2 [ 3 ( log2 3 + 1) ] - 16 
1) 2: 1 multiplexer 1 [ 2 (log2 2 + 1) ] - 4 -
68 
for a total complexity of 68 (as defined for a multiplexer in section 2.5). 
If the number of lines is reduced to four, then the lines in bold italics in figure 
2-14 are not necessary because one never collects eight bits per clock and the 
complexity is reduced to 51 as the right-most four multi·plexers are reduced from 
4: 1 to 3: 1 n1ultiplexers. Similar reductions occur as more lines are eliminated. 
Since we have demonstrated the desire to keep the number of input lines 
to a module to a minimum, we will show that the Output Stager circuit 
presented is indeed minimal. Notice that the Output Stager circuit consists of 
modules of the following type: 
1 N-input module 
1 N /2-input module 
2 N / 4-input modules 
4 N /8-input modules 
• 
• 
• 
N/2 I-input modules 
for a total of N modules altogether. Clearly, at least one N-input module is 
necessary if a there is a task \\·hich produces N bits of output per clock. If 
there are two N/2-bit output tasks (total N bits per clock) then there must be 
at least t\\'O modules of size N /2 or greater. \Ve usr the one N-bit rnodule and 
one N/2-bit module. In general, if there are v concurrent tasks producing N/v 
bits per clock then there must be at least v modules with l 2 N/v. For all 
cases, the Output Stager circuit has exactly v modules meeting the n·ecessary 
requjrernents (the . mjnjmum ·;umber., n.ecessary ),.~ gjvjng . rise. to .the folJowing 
theorem. 
34 
' ., 
' ' • } r-: ' -~::- ' \ 
/ 
Tl1eoren1 4: (Mi11i111ality of Data Collectio11 Network) 1"ht· 
Data Collection \et,\·ork~ consisting of 1\ ()utpul Stager modules, is the 
n1i n irnal net ,vor k capable of collect i 11g data f rorn any set \\' of tasks ( of 
we i g'h t u1) with ( "\"' u,) < A' . 
.;_,, w E H' -
The need for a Permutation Network to . re-order lines in any sequence 
should now be clear. This flexibility is necessary to route the various tasks to 
the appropriate Output St.ager module for data collection. Th is allows a 
minima) Output Stager circuit at the expense of having a Permutation Network, 
resulting in an overall savings in hard,varc. 
Finally we consider the control of the Output Stager modules. Note that 
the Permutation Net\\'ork requires a separate control line for each butterfly 
multiplexer or r1og'l(~\:) 1• On the other hand~ the Output Stager requires N/2 
-
distinct control lines for its ~ 2-word:I-word multiplexers (one in each module). 
This is because modulei and rnodulei+N/ 2 (0 < i < N/2) will share the same 
control line. \\1hat this means is that both modules are either propagating a 
word from the next lo\,·er 1nodule or loading a collected word form its own 
s c rial-to-par a I I e I con \'er t er. l 1 • )Slllg a single control line for two modules is 
per rn i t t c d as 1 on g as no data col Ii s ion 
. 
occurs 1n modulci+N/ 2 as a result. 
Because of the Task Output Scheduling Algorithm there are three cases that 
rr1ust be considered: 
1. neither module is collecting data from the. array; 
2. modulei is used for data collection but modulei+N/ 2 isn't; 
3. rnodulei and modulei+N/ 2 are both used for collecting data. 
Obviously there is no data collision in the first case since neither module 
js .collectjng data ( the control line always selects data propagation). 
' .. ,l • t "':.. 
.,, , ... , .. 
In case 2, if modulei is assigned a task of weight w==l then modulei+N/2 
" 
\ 
\ 
1 
I j 
35 
' 'l . ' . ' ,\ l .I .·, ,; . .i) ~ ··-~ !, I ,J \ -1'1 i. ' 
\\'otdd not b<' <'lir11i11atc·d frorn ror1siderat ioJJ for task scheduliug and, if it is 
unused, then all rnodules \vith a higlt,·r index rnust also be unused because the 
'l'ask ()utput Scheduling Algorithrn schedules tasks to the lowest indexed module 
available. If no modules with index greater than i+N/2 are used there can be 
no data collision in modulei+N/:?" 
If rnodulei has been assigned a task with weight w > 1 then modulei+N/ 2 
is not used for data collection but it will still be loading data from its serial-to-
parallel converter when ever mod ulc\ is, since the rr1od ules use a common control 
line. If this loading in mod ulei+ N / 2 occurs \V hen a word is attempting to 
propagate into modulei 1 N/ 2 then data collision exists. However, by the way 
tasks are assigned to rnod ules, no valid data \Vord can be propagating into 
modulei+N/ 2 at time t (when rnodulei and modulei+N/ 2 are being loaded) because 
this \Vo rd would then arrive at mod u I c i at ti rr1 e t + N / 2 when it would be 
loading again. Since the Task Output Scheduling Algorithm precludes this 
collision. there could have been no valid data propagating into modulei+N/ 2 at 
ti rne t. 
For case 3 to exist both rnodul<\ and rnodulei+N/ 2 rnust each be collecting 
data from tasks of weight. u,~_: 1 
. 
SlflCC\ if 1nodule. \\·as collecting a task of weight 
l 
u, > 1 th(ln modulei+N/ 2 would no/ be assigned a task according to the Task 
Output Scheduling Algori th n·1. 1"herefore, both modules load from their 
respective serial-to-parallel converters once every N clocks and there is no data 
collision. 
Notice that, because of the Task Output Scheduling Algorithm, there exists 
no case 4 \vith modulei unused and modulei+N/ 2 collecting data since tasks are 
~l.:,~;s, ~ssigned to l~w;r· · indexed modules first an/~ if module. had to be 1 
36 
" 
1 ....... ~. 
• 
(•Ii r11 in at ('d f ron 1 sc h< 1d u Ii 11g consid<·r at i o 11 bee a us<· of data collision \\'it b a lo\\·er 
indexed rnodulf', then rnodulei+N/ 2 necessarily had to be clirninated, too. 
Rcrnernber that, to prevent data collision, modules a distance of N/w (or a 
multiple thereof) from an assigned module cannot be assigned a task. Thus if 
modulei is a distance of k(N/w) away from an assigned, module then 
modulei+N/ 2 is a distance of (k + w/2)N/w (a multiple of N/w) away, too. 
Figure 2-15 shows an activity table for N ==8 and five tasks of weight 
These tasks are assigned (by the Task Output 
Scheduling Algorithm) respectively to module0, module.1, module3, module5, and 
. 
module7. This is an example in which module 2 and module6 illustrate case I; 
rnod u1e 0 and rnod ule 4 , case 2; and module 1 and mod ule 5 , case 3. In the figure, 
the presence of an L in a ro\v means that data from the serial-to-parallel 
converter is entering the module and a P means that data word is propagating. 
Rernember that when modulei is being loaded, so is modulei+ 4 (N/2 == 4, in this 
ex a rn p I e ) . 
2. 7 M <1 111c>ry U tiliza tio11 
Overall interface efficiency can be further enhanced by selecting RAM 
mernory size to be large enough to service incoming and outgoing data demands 
sin1ultancously. Consider, for example, the Input Condit~oner & J\1emory block 
sho\vn in figure 2-1 containing a bank of RAM chips ,vhich !:mii:Cally are written 
to by the host computer and read from by the array of PEs. It is true that 
the Input FIFO and Input Stager are bet\\·een the host and the RAM, but 
. 
tbe_·se units operate at the same data rate as the host, so it is effectively the 
·, 
which is what interests us now). Similarly, the Parallel-to-Serial Converter is 
37 
... 
' . ·/• .. 
.I 
M~ 0 I 2 3 4 5 
rr1od ule0 l.10 pl Lo J> 3 I.Jo p5 
module 1 LI p3 p5 
module 2 p3 Jl 5 P1 
module.> l.J ') JJ p7 
J v 5 
, 
module4 I.J 4 p5 L4 P .. L4 { 
module5 Ls p7 
.. 
module6 P1 
. 
. 
module .. L1 I 
L: module being loaded frorn its serial-to-para1lel converter 
P: propagating \vord 
6 
Lo 
p1 
L4 
Figure 2-15: Data Col1ection/Propagation 
7 8 
p1 Lo 
L1 
L3 
L4 
L5 
L1 
clocked \vith the same signal as the PE array (the Switch Network is purely 
combinatorial logic and, therefore, doesn't affect data transfer rates) so it is the 
PE array that governs the data transfer rate out of the RAM bank. Similar 
arguments hold for the Output Conditioner & Mernory and data transfer rates 
so that the interface block diagrarn can be re-drawn, as far as the n1emories are 
concerned, as shown in figure 2-16. 
Let's focus our attention on the Input Conditioner & Memory and then 
relate our results to the Out.put Conditioner & Memory. For the Input, it is 
likely that data will be read from the men1ory to the PE array at a much 
higher rate ... than will be written to the R .. \.M by the host. Nonetheless, while 
.( . ., ... ,·· _ _. ., 
'the 'array l'i:s 'e\trac"ting· d"ata~· 'from'' 'th'e~· m~emory' to'" exe·cute the" i:c-~rrent task, it 
makes sense to allow data to be written to the memory from the host as well. 
38 
"/ 
•,s, ,.··, ,'/ 
___ ._... 
-· 
\ 
HOST 
I 
' 
,NH 
' fH •, 
' I . 
' 
Output Input NA NA 
~ I ~ ~ 2 
RAM RAM I ,, PE Array 
, 
. 
I \ 
' 
Figure 2-16: Interface Memory Block Diagram 
This "simultaneous" read/write capability becomes even more advantageous as 
the data transfer demands of the host approach those of the array: the 
memory will be filling at a rate comparable to the rate it is emptying and 
there will be no "wait" time. 
To effect this dual capability, the RAM bank must be of sufficient • size 
(\vord size~ \vidth) to service the array with data as fast as the array needs it 
and still have idle tin1c (as far as t.hc array is concerned) to s<1rvice the host's 
d e I ll a I l d S l O \ V r j t <> d at. a t O t h (' IT l C' n-1 O r y . rJ" h e f O } } 0 \ \. j Il g e q U a t j O II d C' t e r n l j JI CS t h (' 
nu1nb(lr of RAi\1 chips necessary to irnplernent this dual capacity: 
where 
sR is the number of RAM chips 
TR is the access time of the RAM chips 
NR is the word size of each RAM chip 
T11 is the time between words from the host 
J\T I I i S t h <' \\' 0 rd Si Z (' 0 f t. h C h OS 1 ( b U S ) 
1 
·r:·~is~ th'e, •per\JQct ·'of ~the ·'af;fay'''ctock' , < .. , ~· i •t ,,, ,, .,i '·' >( C ~ 
NA is the "word size" of the array ( the rnaximum number 
of lines requesting distinct bits). 
39 
(2.3) 
. ,, 
This equation is sirnply stating that the nurnber of RAt\1 chips (sR) tirr1es the 
data rate each chip is capable of must equal or exceed the data rate demands 
of the host and array combined. We now know how many RAM chips are 
needed to meet the demands of the system. What needs to be examined is how 
the reading and writing is to be scheduled, since the writipg is done by "cycle 
stealing" from the reading and since NH' NA, and N R may not all be equal. 
Figure 2-17 shows a block diagram of the scheme developed for this. 
N H 
Host I - Register 1 I r 
.,.. ~ 
\ ,, 
Register 2 
... ..-
' 
WR 
~ 
RAM Chips Word .
 
wR=:_sR-NR size, 
WR 
_. ..-
- Register 4 ., 
... ... 
, ' N 
Register, 
IA 
... 
I , 
~ 
Array 
Figure 2-17: Iv1emory Architecture 
Notice in figure 2""17 that for both the read and \vrite there is two-Jevel 
1"" 
buffering. This is done to match word sizes bet \\'een the host and the RAMs 
and bet\veen the RAMs and the array. Sizes of the four registers ,are chosen in 
a spPcific manner. S~ncc R.egistcr 2 and RegistC'r.1 rornrnunicat<' \vit h the RAM, 
.each is a multiple of the .RA«--¥ w.ord siz,~ .. 1!1R (wR·_::::sJJ.-NR) though ·not necessarily 
, • f -~ · " . ,, _. -( ,1, , , 1_(,' ,'I • ,1..._ .,-.. ,' -; , . •!.... 
the same size. Define s A and sH as fol1ows: 
40 
and 
I l{(·gister 4 1 :..- s A· u, ll 
The question, then, is hov, sA and sH should be chosen. The answer comes 
f rorn the fact that the registers are used to rr1atch word sizes. Consider the 
pair R.egistcr 1 and Register 2. We know that Register 2 is written to the RAM 
banks 11JR bits at a time. Therefore, both Register 1 and Register 2 should be a 
length that is a mu]tiple of wR. We also know that Register 1 is fil]ed NH bits 
at a time (size of one host word). Therefore, the length of the registers should 
be a multiple of NH too: 
NHlwR-sH 
or 
(2.4) 
Applying similar arguments for the array and its word size to Register 4 and 
Register 5 yields 
The registers· fun ct ions are conceptually straight for\v ard. l{egist.er 1 is fi lied at a 
rate of lVH bits (the host's word size) every TH seconds until it is filled 
( s 11· w n/ l\,.11 host words). All of Register 1 
. 
then durnped Register 2. IS to 
Register 2 
. 
then written to the RAM bank-s . bytes of . each. IS JD SH SlZC w R 
Analogously, RA!\1 chips read into Register 4 
. bytes u nti] Register 4 
. 
are In WR lS 
full (sA bytes). Register4 is then dumped into Register5 which clocks sA-wR/NA 
bytes of size NA. each to the array of PEs. 
Because the number of RAM chips was selected in such a n1anner that the 
RAMs'. d.ata. ha.!l~Jing. capability. meets or .. exceeds .the, combjned. dem.1.nds of ·the 
host and the array, we know that the data transfer scheme just described is 
41 
... 
i 
cHpable of ser\'icing the host and array '"sirnultancously'·. That is, whcn<•ver the 
host \vish(ls to send a data byte, there n1ust be roorn for it in Register 1 and, 
\vhenever the array requests a data byte, one must be available in Register 5. 
The next section describes a method that might be used by a memory controller 
to effect this read/write scheduling for the memory. 
2. 7.1 Memory Control 
Certain aspects of register loading are known exactly. Register2 must be 
loaded \vith the contents of Register 1 every ( s H·wR/ N H)·T H seconds. This is how 
long it takes the host to fill Register 1. If Register 2 isn't loaded when Register 1 
fills. there will be no room in Register 1 when the next data word arrives from 
the host~ resulting in a loss of data. Similarly, Register 4 must be fi1led at least 
every ( s A· u, RINA) ··TA seconds, since this is how frequently the array empties 
Register 5. Failure to meet this demand results in the arrray trying to read 
data that isn't present. 
Read/\\1 rite control operation can no,v be discussed. Arbitrarily give the 
array higher priority than the host as the default condition. That is, rnernory 
will norma1ly be accessed by the array to \\/rite to RPgister 4 (until it is filled) 
unless the host demands access to the men1ory. This demand is the result of a 
situation presented earlier: Register 1 is filled (or will be filled before Register2 
is en-1pticd into the memory) and needs to be dumped into Register 2 to make 
room for the next \Vord of data from the host. Under these circumstances, the 
host \viii have access to write to the array (remember that all these arguments 
are presented for data input to the array but are analgous for the output as 
well). Because the memory was chosen sufficiently large to meet or ·. exceed .aJJ 
• I 
data I/0 demands, this method of read/write control will work with no access 
42 
.. ·1, "' 
con flirts. FigtJr<" 2-18 shov.·s all activity t.able <>rnploying the technique just 
disc11ss<>d for thP f o I J o \v· i n g paran1c\ters: 
NH - 12 'f 11 -~ r.: 'r - - ll . 
N -- IO 1"' 21" -·· A A 
NR 8 1" R 3T 
which yields: 
SR -- 3 frorn equation (2.3) --
SH - I from equation (2.4) - . 
5 frorn . (2.5) SA - equation ---· 
2.7.2 Optimization of Men1ory Size 
\\7e have seen in equation (2.3) an expression for the ~minimum number of 
R.~J\1s necessary to service both the host and the array simultaneously. Upon 
investigation, it \Vas discovered that this number is not necessarily optimal. 
Since our goal is an efficient interface, hardware reduction is a prime concern. 
I)efi 11r corn pl ex i ty here as simply: 
(2.6) 
(~alculating cornplexity for different sH_·s, while the other pararncters (N 8 , NA, 
N11 , TH, TA' and TR) are held constant, demonstrated that the optimal nurnber 
of f~AMs is often not the minimun1 nurnber. Results for t\vo such cases are 
listed in figure 2-19, \vhere each table begins with the minimum sR calculated 
from equation ( 2. 3) . Note that if 11,•R == LCM ( NH, NA) , then s A == s H == 1 and 
all registers have the same width as the memory. In most cases, this ___ ~ also 
yields m1n1mum complexity. 
' " 
~ I • (. ' ~- . ' . ~ ., ·{ ' •• ' ~· . ~' .C/ 
43 
,4. ' • .... ._, " - ' ) \ / ·( ' t, ... l 'f.. •. ..#{ 
·,_ 
Time 0 5 10 15 20 25 30 
. 
Register1 • • • • • • • 
Register) • t • • 
RAM • • • 
Register4 • • • • • • • • ... 
~ 
·~ 
_Register5 • • 
...... 
O'Q 
t= 
~ Time 35 40 45 50 55 6.0 65 
ro 
~ 
. 
Register1 • • • • • • • -I "· 
-·· 
J,-,l 
00 
Register2 • • -
•• 
.... 
RAM • • • • 
·-· 
~ _,;. 
; I t"'O 
"""( 
Register4 • • ' • •
 • • 
· Registerr:; 
• _, ~ 
0 ~ ,-
A ""1 ' ·~ .. ~ ........._ Time 70 75 80 85 95 90 100 
::0 
~ ~ 
OQ ' 
. 
• • · Register_, • • • • • • 
-· U) 
~ , 
C1) -. 
Register) ~ • • • 
...., RAM • • • > (") 
r+ 
Register L • • • • • • 
-· < 
-· 
Register~ • • -~
~ 
~ 
~ 
O'"" 
-C1) 
Time 105 110 115 125 120 130 135 
- -
. Register 1 • • • • • • • Register2 ~ • 
RAM 
. t 
-
- -Register~ • • • • • • • Registerc:: • 
J 
, 
I 
\ 
/ 
.-,,-' 
1\ 10 i\ ll - 8 I'\ H. ::,:- I --A 
1' 2 rr 8 'I'R =: 
... 
·-
·-·-· ~ 
- -
--
A }I 
SR WR SA SH C 
31 31 10 8 558 
32 32 5 1 192 
33 33 10 8 594 
34 34 5 4 306 
35 35 2 8 348 
36 36 5 2 252 
37 37 10 8 666 
38 38 5 4 342 
39 39 10 8 702 
40 40 1 1 " 80 
41 41 10 8 738 
42 42 5 4 378 
43 43 10 8 774 
' 
N ~- 10 NH - 8 NR - 8 - - -A 
TA 2 TH 7 TR 
... 
-
- -
,'.) 
- -
-
SH_ WR SA SH C 
4 32 5 I 192 
5 40 1 1 80 
6 48 5 1 288 
7 56 5 1 336 
8 64 5 1 384 
9 72 5 1 432 
10 80 1 1 160 
11 88 5 1 528 
12 96 5 1 576 
13 104 5 1 624 
14 112 5 1 672 
15 120 1 1 740 
16 128 5 1 768 
, ,. Figure 2-19: Complexity vs. Men1ory Size 
. . : l •., 
,· 
.. 
45 
, 
Chapter 3 
A Specific Interface Design 
3.1 IntrocI11ction 
This chapter is devoted to the design of an interface circuit to be used in 
an ongoing research program sponsored by Accusort Inc. This design can be 
thought of as a specific example of the more general (and complex) interface 
described in the previous chapter. In this particular design, blocks of the 
general interface designated as the Input and Output Conditioner & Memory, 
the Input Stager, and the Permutation Network are not employed. The other 
blocks cornprising the general interface have been incorporated in this design but 
are less complex than their counterparts of the general interface circuit 
previously described. 
In the project under discussion, the task at hand is to provide an interface 
so that character strings resident in the host computer's rnemory can be trans-
ferred to a string matching rr1odule circuit. The host cornputer is based on an 
I 11 tel 8085 rnicroprocessor and the rnatching mod u]e is a bit-serial \'LSI custon1 
design [8]. 
The ultin1ate goal of the syst.em is to be able to reconstruct bar code 
labels read by a linear laser scan from packages rnoving on a conveyor belt. In 
cases ,v here the laser sweeps through the entire label in one scan, the 
reconstruction is trivial. llowe\'er, when no single scan s,veeps through the 
entire label, but only part of it, the label must be reconstructed from the 
fragments obtained from each scan. Figure 3-la depicts a situation where a 
'.".. ~ '· .. • It 0 •• , f . ..... . · .. ,, 
scan will s\vcep through an entire label. Figure 3-1 b illustrates a situation 
46 
,--.··'"'"' 
• 
' -· 
Laser 
Scan 
.._ 
Package 
. 
moving 
(a) Laser scans 
entire label. 
Figure 3-1: 
Laser 
Scan 
Laser Scan 
Package 
. 
moving 
(b) Laser scans only 
part of label. 
where no one scan will cut through the entire label. There will be several 
scans of the label as it moves through the s\veep field of the laser providing 
label fragments with overlap. This overlap is caused by two consecutive scans 
cutting through some cornrnon area of the label. 
' 
The matching module is utilized in the reconstruction of the label by 
matching fragments' overlap. Conceptually, the module will compare a ne\\' 
string ( called the A-string) to an existing string ( called the X-string) and 
attempt to concatenate the A-string to the X-string. The strings are three-
valued- 0, 1, W- where O and 1 are conventional Boolean logic values and \V 
( wild card) can be thought of as a don't care. That is, when checking for a 
match with W the result is always true, whether comparing O and W, 1 and 
W, or W and W. 
! . ( 
.... -.... 
47 
• 
f 
1'hcl basic interfacing problern is twofold: 
I. that data is available from the host computer in 8-bit words while 
the rnatching module needs to be loaded in serial form and 
2. the matching module is capable of operating speeds far in excess of 
the host's ability to make data available. 
For these reasons, a FIFO to hold an entire string of characters (maximum of 
48 in the Accusort project) and parallel-to-serial shift registers are employed. 
The manner in \\'hich this alleviates the timing problem is simple: the host 
sin1ply fills the FIFO \vith all the data as fast as it can and then irutrucu the 
n1od u lP that the necessary data is no\\' resident in the F If 0. The problem of 
parallel to serial 
more complicated. 
. 
conversion to load the strings into the matching module is 
There are two basic loading modes for the matching module: X-string load 
and A-string load. An X-string load is done when an entirely new label 
reconstruction process is to cornmence (there is no current string to be 
concatcnat<·d) and the initial string (X-string) is to be loaded into the rnatching 
rnodule. In the case of an A-string load, a partially reconstructed X-string is 
already resident in the n1odulc and this ne~' A-string is to be concatenated to 
it. Because of the design of the matching module, an X-string load is 
performed by serially loading one character per clock cycle while an A-string 
load requires that t\\10 characters be loaded per clock cycle. 
irnplcmen ting an X-string load is the easier of the two tasks. 
Obviously, 
Each of the strings consists of l's and O's representing wide and narrow 
bars · and spaces· encountered by the laser .scans. Difficulty ,arises. based . ·on the 
orientation of the label relative to the laser scan. Ideally, one would like the 
48 
. 
scan to always begin at one end of the label and proceed toward the other end 
(see figure 3-2). 
Scan Bi ts read 
1st ho bl b2 ... 
2nd b4 ,b5 b6 ••• 
Laser 
3rd b7 b8 b9 ... 
Scan 
Package 
. 
moving 
Figure 3-2: Good Scan 
Ho\lJever, if the label is angled wrong, the bits will be read in reverse order as 
shown in figure 3-3. The host computer readily recognizes when a label has 
been read in the reverse order and will output the data words to the FIFO in 
proper order. Ho ,v ever, it would take the host a pro h i bit i v e I y long time to 
properly reorder the bits \vithin each bytP. Instead, the host signals the 
interface hard\\' are that this situation exists th roHgh a control signal "Flip" and 
the bit juggling is done by the interface. 
3.3 Proble1n Solution 
There are four cases of string loading which need to be addressed: 
1. Normal X-string load; 
2. FI i µ X-string load; 
·3. 'N'ormal A-stririg lo~d·; 
· 4. Flip A-string load. 
' • I,· ' 
. ' 
\· ... •{ 
C ··< 
49 
• . J... . .( ,· ·, •,. 
f •(, 
I~ . 
Scan Bi ts read 
' 1st b2 bl bo ... 
2nd b7 b6 b5 ••• 
Laser 3rd blO bg b8 ... 
Scan 
Package 
. 
moving 
Figure 3-3: Inside-out Scan 
Case 1 is the most straightfor\vard to implement and easiest to understand. It 
will therefore be discussed first. 
For a normal \:-string load, the data is resid()Jlt in the FIFO as depicted 
in figure 3-4 ,vhen the loading is to begin. 
FIFO: X-string8_ 15 
X-string0_ i 
X count 
• F ' ~ 1 < ,! ' ' f 
·• ' 
The first word in the FIFO is X count, an 8-bit word whose magnitude is the 
50 
( 
nurnber of non-v.'ild-card characters in the X-string. The remaining words in 
the FIFO are the non-wild-card characters (0 or I) themselves arranged as in 
figure 3-5. 
x8n+7 x8n+6 x8n+5 x8n+4 x8n+3 x8n+2 x8n+l x8n 
n == 0, I, 2, 3, 4, 5 
Figure 3-5: Organization of a string in FIFO 
Since the X-string is three-valued, there are two X-string input lines to the 
matching module: X and W . When W ==0 the binary value on the X line is 
X X 
the character of the X-string at that time. \\7hen W ==1 the character of the 
X 
X-string is a wild card at that time, regardless of the value of X. The three-
valued A-string is similarly implemented. 
To load a normal X-string resident in the FIFO into the matching 
n1od ule, the X count is first Joade<l in to a count-do\\' n cou n trr ( X __ rou Il ter) 
and then the first \\i'ord of the X-string is loaded into a shift-right register. 
Also, a count-do\\·n counter is loaded ~·ith 48, the expected reconstructed string 
length in the Accusort project. This counter will be referred to as the 
48 counter. The shift register, X counter, and the 48 counter are now all 
enabled. Additionally, at every eighth clock pulse, the shift register is loaded 
with the next 8 string bits frorn the FIFO by a control signal frorn the 
tn i .~ r o co r 1 t r o 11 cl r . \\7 is O as Jong as the X 
X 
counter is non-zero (indicating there 
· ··are · still~ ··non~-wild~catd"·· tha·rat,ters). . \\'hen · tli'e X counter reaches f. 0 · it stop·s· · .. , 
6 
counting and W x==l, indicating that the remaining string characters are wild-
51 
... ,.-,. 
cards. When the 48 counter reaches O all 48 characters have been loaded into 
I 
the matching module and the load instruction has been executed. 
In the case of a Flip X-string load, complications are introduced: the 
characters in the X-strin·g word are in reverse order and, if the number of non-
wild-card characters isn't a multiple of 8, there is some initial set-up of the 
data to be performed. As an example/ - let the number of non-wild-card 
characters be 11 (X_count==ll). 
arranged as shown in figure 3-6. 
FIFO 
I b3 b4 b5 b6 b7 
The string words in the FIFO would be 
b8 bg blO I 2nd ,vord X-string 
I --- --- bo bl b2 I I st \\'Ord 
X-string 
I X count 
Figure 3-6: FIFO \vord organization- Flip 
Clearly, for the second and succeeding \vords of the X-string, all that needs to 
be done is to use a multiplexer to route the right-most bit of the FIFO into 
the left-most bit of the shift register, etc. (see figure 3-7). Ho\vever, for the 
first \\1ord of the X-string, the shift register contents \Vil] have five (in this 
exan1ple). initial junk bits (see figure 3-8) after going through the multiplexer. 
Therefore, for the first word of the X-string in .the case of a Flip X-string load, 
\( ,( 
the shift register must be clocked N tirnes (where N==[8-(X_count MOD 8)]) 
52 
FIFO Output 
. 
' 
I 
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 
Sell MUX Flin 
r -
,, 
MUX7 MUXO 
Shift Register .... , 
Figure 3-7: Flip MUX 
'---I b2_._I b_1 ___ I b___ 0 I-_--~I --__ -I_---_I -_--.._I _] Shift Register 
Figure 3-8: Shift R.egister 
unless X count is a multiple of 8 in which case no initial shifts are done. 
This scheme of initia] shifts is imple111ented through control signals from the 
microcontroller \vhich looks at the X count to determine how many initial 
shifts need to be performed by the shift register. During these initial shifts, the 
X counter and 48 counter are disabled. After the initial shifting, the down-
.. 
' . . . ,'' ,( 
c·ounters are enabl'ed · and loading procee·ds as a normal X-string load with the 
microcontroller signalling when the shift, register is to be loaded from the FIFO 
I. 
53 
,, 
( 
(still every R clocks, but. ,vitb au offs<·l due to t}u• initial shifts). 
When executing an A-string load~ which rnust load t.~·o characters per 
clock cycle, the proble111 is further rornplicated. As before, the norrnal A-string 
load will be discussed first since it is more straightforward than the Flip A-
string load. When performing an A-string load, the FIFO contains data in the 
same format as for an X-string load: the first word in the FIFO is A count 
and the remaining \Vords are the A-string characters. Since the A-string is 
loaded two characters per clock cycle an·d there are two lines per character 
there are four A-string inputs to the matching module: A and W each for a 
upper cells and louJer cells. 
Shift Reg Upper 
MUX 0 
MUX 7 
Shift Reg Lower 
Delay /Cross 
' 
'-------110 
V 
Figure 3-9: 
Odd 
N 
0 
~ 
--.... I 
0 
:/ 
r---.... 
1 
---110 
'/ 
A-String Load 
Upper 
Cells 
A 
\\l 
Lower 
Cells 
A 
,. Figure 3-9 depicts in block diagram form how the A-string load is 
54 
,,. 
in1plcrn<;J_1tc~d. The upper cells of the rnatching niodule arcl to r<lrc~ive characters 
a0, a 2, a 4 ..• and the lower cells a 1, a 3, a 5 •.. in that order. Again, as with an 
X-string load, the normal (not Flip) A-string load will be discussed first 
because it is conceptually clearer. 
To simplify discussion, an example with A-string length of 13 will be used 
to explain how an A-string load is executed. For a normal A-string load the 
con ten ts of the shift register will be as shown in figure 3-10. 
I ---1 
( b) 2nd ,vord (a) 1 st. \Vo rd 
Figure 3-10: SR Configuration 
For an A-string load the total length counter is loaded with 24 (was 48 for an 
X-string load) and the A_counter \vith M (\vhere M==[A_count DlV 2]). This 
is because two charactets per clock cycle are loaded into the n1atching module. 
~ince this is a · normal (not Flip) A-string load, the microcontroller sets 
Delay /Cross- the select line of a multiplexer-. to O (this MUX will be further 
. . 
.~x.pJajned later). F~gure _3~10a shows th.e .first wor.d.· .of the A~st:r)ng_.in. the ... shift 
registers. Everything is now ready for execution and the shift registers, 
55 
',. •:\. 
r, 
,-
\ 
\ 
24 counter, and A counter are enabled. The microcontrollcr signals the FIFO 
to load the shift registers with a new word every four th clock pulse, since two 
bits are shifted each clock cycle. In the example, after the sixth clock cycle, 
the shift registers look like figure 3-11 and the A counter is O ( 13 DIV 2 :--
6). 
1---1--- 1---1 al 2 I 
1---1 ---1 ---1 ---1 
Figure 3-11: SR After 6 Clocks 
Notice that on the next clock cycle, the lo\ver cells should be loaded with a 
wild-card character while the upper cells should clock in a 12 . The wild-card 
MUX select line should be high (Odd==l) to accornplish the necessary one clock 
delay for \\' a to the upper cells (refer to figure 3-9). This delay in wild card 
generation is necessary any time the number of non-\vild-card characters in the 
A-string is odd. The n1icrocon trailer tests the LS B of . .\_count {first word in 
the FIFO before load execution) and if it is 1, sets Odd to 1 and consequently 
the delayed \\1 a is selected for the upper cells by the MLIX. Things proceed in 
much the same manner as for an X-string load: \Vhen the A counter reaches 0 
it stops counting and -- W ==1. 
a 
Execution . is complete when the 24 counter 
_. .reaches 0. 
·' ' ; 
•, ·I 
. (. ~ . . ; ' 
In the case where the label is read in reverse order (Flip is true) there is 
56 
. . . •\ 
" \ .;: 
._, 
a further complexity sirnilar to . that found in an X-string load. For our 
cxarnple of I :i non-wi)d-card characters. figure :l-12 shows thP state of the shift 
registers for the wor~~. 
(b) 2nd Word (a) 1st \\lord 
Figure 3-12: SR Initial, Flip True 
Notice that to prepare a 1 to be loaded into the matching module, the upper 
. 
shift register must be initially shifted twice. Since the upper and lower shift 
registers are clocked simultaneously, the lower shift register will be shifted twice 
also and a 0 \vill then reside in the one-bit delav register (D in figure 3-9). 
Remember that \\'hen doing an A-string load, a 0 and a 1 are loaded 
simultan~ously, as are a 2 and a 3, etc. and that even-numbered components are 
r· 
to be loaded into the upper cells and the odd-nurnbered components in the 
lower eel ls. For this reason, the Delay/ Cross control line in figure 3-9 is 
driven high by the n·1icrocon troll er \v' hen doing an A-string load if both Flip 
and Odd are true. 
Figure 3-13 demonstrates the ~ loading of our example string from clock 
pulse 2 thro_ugh clo_c~ pulse 4. As before, W ==l when the A counter reaches 
a I. 
r",._ .-,(~ 
0. Also as before, W 
a 
to the upper cells is delayed one clock cycle . since the 
57 
\ 
<. ( 
' ' 
FIFO clocked 
into Sl{s 
Figure 3-13: SR: Clocks 2 thru 4 
string length is odd. Loading is complete when the 24 counter reaches 0. 
3.4 Desig11 Pri11ciples 
There are three functions that the matching module and the interface 
hardware must perforn1: 
1. begin a totally new string; 
,.., 
2. attempt to add to an existing, partially reconstructed string; 
3. remov·e · some characters from the .end of an .. existing .strjng. 
The instructions for these three functions are, respectively: 
58 
, .. ~ ,(. 
,· ,_ .... 
,( 
j' 
1. I{eset; 
2. 1\dd; 
3. Undo. 
The microcode to execute these instructions is resident 
• beginning at In memory 
locations 04H(Reset), 74H(Add), and FCH(Undo). Examining the binary 
representation of each (below) 
b7 b6 b5 b4 b3 b2 bl bo 
04H 0 0 0 0 0 I 0 0 
74H 0 I 1 1 0 I 0 0 
FCH I 1 I 1 1 I 0 0 
recognize that the last three bi ts {b 2 , b 1 , b0 ) don't change and that b3 == b7 and 
b4=b5=b6• Therefore, only two bits are needed from the host computer to 
specify the instruction to be executed if the 8-bit microprogram address is 
generated as shown in figure 3-14. 
b7 b6 b5 b4 b3 b2 b1 bo 
µ-Sequencer 
-' 
Figure 3-14: Microsequencer 
In addition to the two bits of the instruction register specifying what the 
general instruction is, there are two. other bits in the instruction word- Flip 
' - . ... ,, 
, and Bar /Space. Flip and its function have been discussed· previously.;. it is an 
' 
59 
indication of whether or not the laser scanned the · }qbel "inside out,,. 
Bar /Space indicates whether the first character of the A-string is the result of 
the laser having read a bar (black stripe on the label) or a space between two 
ba~ (both bars and spaces contain information about the label). The matching 
module uses the Bar /Space bit to ensure that a 1 ( or 0) of the A-string that 
corresponds to a bar doesn't get matched up with a 1 ( or 0) of the X-string 
corresponding to a space between bars and vice versa. A summary of the 4-bit 
instruction set as implemented is given in figure 3-15. 
Four-bit Instruction Codes: 
IR3 IR2 IRI IRO 
. 
X 0 0 0 Totally new string, characters inside out 
X I 0 0 Totally new string, characters in order 
0 0 0 1 A-string to be added, inside out, Space first 
1 0 0 I A-string to be added, inside out, Bar first 
1 1 0 I A-string to be added, in order, Bar first 
0 1 0 I A-string to be added, in order, Space first 
X X 1 1 String characters to be removed 
Interpretations of Specific Instruction Bits: 
Bar /Space =- 0 -+ First character is a space 
i 
Flip == c/ -+ lnsidP-out scan 
IR 1 0 : 00 R.eset , 
01 Add 
Ix Undo 
Figure 3-15: Instruction Set 
The interface is address mapped by the host computer with very simple 
,. ·r' ,; '"- ~<;'~.·· . . . ,, .. , 
handshaking on the multi-bus of the 8085. The individual addresses on the 
J 
t 
' 
-., •-<. 
/ 
" 
I . Input li' IF() (data) ; 
2. Instruction register (instruction word); 
3. Instruction in FF; 
4. Input FIJ;'Q Clear; 
5. lJndo Counter (length data); 
6. Output Flf"O (data output to host); 
7. Current count (length data output to host). 
These addresses are decoded and the necessary enabling signals for the 
individual devices generated. ~~or I through 5, the device is clocked when the 
handshake signal MWTC from the multi-bus goes low and the handshake signal 
XACK is then generated by the interface. For 6 and 7, the devices are read 
when J\1R,-fC is driven low by the host. This action does not affect any other 
f u n ct ion o cc u r r i n g at the ti rn e. 
3.5 CollPcti11g Res11lts 
\ 1 alid <lata begins to propagate from the matching rr1odule circuit 24 ,.clock 
cycles after an .i\.-string load has been complctc·d. This data is the newly 
reconstructed X and W , on two separate lines. As this data is generated, a X V 
counter is enabled as long as W ~o, indicating that the string character at that X 
time is not a wild card. At the completion of 48 clock cycles, the counter 
,value will be the number of non-wild-card characters in the newly reconstructed 
string. 
< 
input parallel-output shift register. The contents of this shift register are 
61 
'·' 
rlorkc·d into an output J:''IF'() <·very eight r)ock cycles on cornrnand frorn the 
i n t c> r far e c o n t r o JI <, r . ThP FIFO is then read upon de111and by the host 
cornputcr . 
. 
3.6 Co11trol 
All control signals in the project under discussion (both for the interface 
and the matching module itself) are generated by a microcontroller. This unit 
consists of a microsequencer (AM291 l ), PRO Ms, and multiplexers. This 
-
approach to the design of the control unit was selected because it has the 
advantage of good speed (faster than a microprocessor), flexibility (easy to 
expand the nurnber of control signals or change them by simply adding or re-
burning EPR.Oi\1S), and ease of design. 
3.6.1 Microcontroller Hardware 
The Al\12911 n1icroscquenccr block diagram is shown in figure 3-16. The 
control logic in the Accusort project is rather sin1ple, so neither the 2911 's 
Stack nor its l{cgister are used (shaded blocks in figure 3-16). The AM2911 is 
used sirnply to generate the next µPC address by incrernenting the current 
address (µPC +- µPC+ 1) and se]ccting either this next instruction address or 
the address on the D/R lines (a JlJ~1P) depending on whether S 1S0 are 00 or 
11. In this project, the JUMP address is either generated within the 
microprogram (address has been burned into the PROM) or is external (namely 
the starting address of our 3-instruction set). Additionally~ conditions must be 
tested to determine whether or not a JUMP should be performed. Figure 3-17 
. , .( .. : ..5llus1.r.r;ites the_ microsequencer with the. addjt,i.ona~ "~upport.. logic~ >- ·~'. , .,1,_ <, ·l 
• 
In figure 3-17, the three bitS from the PROM labeled Branch Selict 
62 
. ' 
19·· 
D/R 
s1,so 
' " \. " Register\ 
\. ' ' ' FE 
PUP 
..... '\. ' " 
_.__ __ -,t· 
1
, ST A CK ' 
I'\. \ ' ' 
... / \ ,/ 1 ~ 
' 
0 I 2 
MUX 
Figure 3-16: 
.J 
3 µPC 
Register 
I 
.... Incremen ter 
I 
C 
out 
----c. 1n 
µ-Sequencer Block Diagram 
, 
determine which condition should be tested. For example, if !Jranch S.,elect is 
111 then logic I is selected from the test condition MUX and an unconditional 
JUMP will be executed. The destination address will either be an· instruction if 
Address Select is 1 or wi11 be the address in the jump-address-field of the 
i 
PROM word if Address Select is 0. If Branch. Select is 000 then logic O is 
selected by the test condition Jv1lfX and no jump will be executed- the next 
address will be the current address plus one. If Branch Select is 001 through 
110 then a JUMP will occur if that c9ndition selected for test is true (logic 1) 
. at .th.e .time of the .. next cJo.ck.J , ) .· .. , :1,:·- -~ 
.. ~ • f 
... 
' .' ·\ ,, > t . k .... < ..... ~ ' ·,, ,,' ··< . 11. .. ·:A.. 
63 
./ 
• 
. . , . ·v .t. , 
( - -.; 
Input 
~. "' 
7 
---46 
----t5 M 
/ ~Instruction 
" I 
Test ---t4 U ~--... ----tS I 
Conditions ----13 X 
-----t2 
---tl 
-o 
3v 
Branch 1 Select 
I 
~ Controls I 
"' Signals I 
Figure 3-17: 
3. 8 1\'1icroscque11cer Soft"'· arc 
so 
µ-Sequencer 
,, 
Latch 
PROM 
.4 d d r. Se I e ct 
~ext Address , 
I 
µController Block Diagram 
Developrn<'nt of the code for the microsequencer will be discussed in the 
follo\\·ing order: \vaiting for an instruction; X-string Joad; }"\-st.ring load; Undo. 
Initial address at power-up and after completion of any instruction is 0011. At 
location ODIi, the Instruction . in (I FF bit (001 Br0:nch Select) is.· tested and, if it· 
·, 
be cxe_cutcq), a ,JC i\1P 
' J ... > I , .. 
. . ' I, ,t,. .· ·(;._ 4. " "''1 \. ...> ... 
~ ,-.·, ,,, _ l~~~ ~~-e~ set (j11~ic:~ting. t}l._at ... t,herc is ap instructi_on to 
"tt.:.:~ .. f. ~------.............. .· ~ :,-..,'-1 ,r1l __ <-{>,. 1,J· ,-' '· • ...... ~.· .r;~ , ·.,'· ',1' . "· < ... ~ ... --l...'.-: .... .«, . l,/ .. l .. \." ·~,·. ~1t_,_ - .\, 
. 
to the instruction address (either_ 04IJ, 74H, or FCI1) is taken. Other\visc, the 
64 
... 
n<1 xt · addrPss is 0111 \vhich sirr1ply jurnµs uriconditionally bark to 0011. Until 
thP rnicrorontroll<'r is inforrr1ed that an instruction is t.o be executed, the µPC 
waits in the OOH to 0111 loop. 
The X-string load instruction routine begins Ciat 0411. Here two counters 
are loaded: Scratch with the 3 LSBs of X_count {remember that X count is 
the number of non-wild-card characters in the X-string) and 6 count with the· 
value six. 
The 6 count counter is responsible for counting the number of times 
that an X-string load software loop has been executed ( each time through the 
loop loads one 8-character data word into a shift register and clocks the FIFO 
for the next word). \\/hen the 6 count has reached 0, all 48 characters of 
the X-string have been loaded and the X-string load instruction is completed. 
The prograrn will then jump back to OOH to await the next instruction. 
The Scratch counter is used when a string has been read "inside-out" 
(described earlier in this chapter) and the number of non-wild-card characters is 
not a rnultiple of 8. As previously mentioned, the first word of characters rr1ust 
be initially shifted [8-(X_count MOD 8)] times for this case. Since Scratch is 
loaded \Vith the 3 LSBs of X_count (this equals X_count MOD 8), the first 
data word is shifted and the Scratch counter incrernented until Scratch 
becomes eight. The first word has then been prepared and X-string loading can 
commence. 
The program will now jump to a load routine related to the three LSBs 
of. X count. Different routines are necessary because the second data word 
~ . 
~~.Jf,,,1_/~~- .· :: .. m.ust be _fetched .vfrom .. the. FIFO at. different JjLT1es. .. for .. djfferent. X-strjn.g )e,~g;~hs . 
(all this applies only when Flip is true). For example, if X count is 37, then 
65 
.. 
thc·rc· art' fi,·<· charart.c•rs in the first data v.-ord (:17 I\1()1) 8 -_ 5) and, aft.t)r X-
string loading has cornrnenr<·d, t IH· s<·cond data word must be fetched after five 
clock rvcles. On th(l other hand, if X count is 19 then the second data word 
w 
must be fetched after three clock cycles ( 19 MOD 8 = 3). In all cases, after 
the first word, future data \Vords must be fetched every eight clock cycles. 
An example of the "'code~ for an X-string load (for X count MOD 8 
5) . sho\vn in figure 3-18 . lS 
Begin: Enable X count, X-shift ,. 
Enable X count, X-shift 
Enable X count, X-shift 
Enable X count, X-shift 
Enable X count, X-shift; fetch next word 
Enable X count~ X-shift 
Enable X count. X-shift 
Enable X count, X-shift, 6 count; jun1p to Begin if TC6 t 0 
Routine done; jump to 001-I for next instruction 
Figure 3-18: X-string load routine 
Notice that the loop that is repeated is eight lines Jong. This means that a 
ne\,. data \Vord \viii be fetched on the 5th, 13th, 21st, 29th, etc. clock cycles. In 
other \vords, after the initial \vord is loaded, each succeeding word is fetched 
every eight clock cycles until all words have been fetched. Figure 3-19 shows 
the flow <'hart for an X-string load. , 
1"he software for an A-string load is structured in much the sarr1e rnanner 
as an X-string load. There are some differences, however. For instance, since 
the A-string is loaded two characters per clock cycle, a ne\\· data word rnust be 
fetched from the FIFO every fourth clock cycle rather than every eighth as with 
an X-string load. Just as ,vith an X-string load, no initial shifting of the first 
. • ' -~~\; 'rJ,i ~ ~, . . . ' ' • . ' . { .. ~ .... 
word of t.!ttl A-string . i's ;··necessary if the label -was not scanned inside out or if 
the length of the string is a multiple of eight. 
66 
• : '" ·'1.. ! (· • '\. 
Set i==8 
w ==0 
X 
Enable SR; Decrement i, 
X count, 48 count 
Set i==8 
Clock FIFO 
to SR 
i==O'? 
Set 
W ==1 
X 
---------------· 
Begin 
Load X counter 
Load SR 
Load 48 counter 
Decrement 
X count 
Decrement 
48 count 
/ 
Set i==O; Shift 
(8-(X_count DIV 8)) 
times 
Enable SR & 
Decrement 
X_count, 
48 count 
coun 
multiple 
of 8? 
Figure 3-19: X-stri11g load Flowchart 
( ,')"-V·l-~:-i' ~ - .- . 
, _;,_,j. . ~ ......... - . ·-· .. ..,.~. ,, I _..... ' -'j /· '. ' ., __ ·<.. ...~ I. I • <_ I 1~' • .·•.:..., ..... 
67 
-. 
' 
If Flip is t.ru() (labC'l srar111(·d iJ1sid<' out) aud the string length (A __ count) 
is not a n1ultiplc) of <'ight then thP first word of the A-string in the shift 
register rr1ust b<) shifted initially ( refer to figure 3-12). rfhe nun1bcr of initial 
shifts to be performed is P, where P = [4 - (A _count MOD 8) DIV 2]. To 
perforrn this, the Scratch counter is loaded with A_count2,A_count1 . (this is 
0 
P). When the Scratch counter reaches four, the initial shifting is done and 
the A-string load can proceed. 
For an A-string load, the 6 count is again loaded with the value 
. 
SIX 
and the soft\\'are loops arc each four lines long. Since two characters are 
loaded per line, six times through the loop will load a11 48 characters. Figure 
3-20 gives the code for loading ar. A-string \\'here P is 2. The flowchart for an 
A-string load is shown in figure 3-21. 
Begin: Enable A_count, A-shift 
Enable A count, A-shift; fetch next word 
Enable A count, .t\-shift 
En able A _ co u n t, .t\ -shift, 6 co u n t; j ump to Begin if TC 6 I= 0 · 
Jump to OOII for next instruction 
Figure 3-20: A-string load routine 
An l)ndo instruction is executed when the host computer recognizes that 
sorne part of the reconstructed string does not belong \Vhere placed. It then 
requests that all characters in the reconstructed string after a certain number 
(Undo_count) be converted to wild-cards, effectively erasing these characters. 
The implen1entation is rather simple- the host computer memory writes the 
Undo count to a down counter and, \Vhen this counter reaches 0, the 
remaining (48 - Undo_count) characters are made wild-cards. When finis}:i~d, 
the prograqi. again. returns to OOH to wait for the next instruction. A flowchart 
• ,·, . . <(_ -< ,·. :.... .t. .....-· - . < ' v'"<·<.. , •. }'<- i.,. . ·,t < ·. ,.~,;),,r.i'.~:-~· . .., .,, '~ "-' c.. < 
for an Undo instruction is given in figure 3-22. 
68 
• 
\ 
r 
Set i=4 
W ==0 A 
Enable Slls &. 
Deer A count, 
i, 24 count 
l3e in 
Load A counter 
w /(A_count DIV 2) 
Load 24 counter 
Clock F'IFC> to Slls 
Flip 
l)one 
i-=O? 
y 
W A(Upper)=l 
W A(Lower)==l 
\V A(Upper)==O 
WA (Lower)=l 
Done count . 
-=O? 
Decrement 
24 counter 
>----W Set Delay 
·,. 
Cross=l 
Shift 
4-(A_- count DIV 4) 
times 
Set i ==4 
Clock FIFO 
to SRs 
'·· ·:< ' .. ' .. 
A coun 
mult. 
of 4? 
N 
Deer A count, 
24_count; 
Enable SRs 
.,· __ .,., r, ·< 
. , •.:' .. "· (. .. ' -<, ·-<, 
Figure 3-21: A-string load Flowchart 
.... 
69 
'· 
Begin 
Load 48 count, 
Undo count 
count 
? 
. 
w 
X 
0 
Decrement 
48 count, 
l.Jndo count 
'•, 
Done 
~: ,. I.cl,/,···- - . ' ,...._-, , 1( -~, ( ·, v...<'. ,( .. .. -, , • ' :.{ ,', ..A . < ~· · i <''<I ·""' -<..: • I , J '1 , · · v•t, ,\., ,, ,( <"', <(_ 
.. --,..=a~-)~ ·-f.... ....... .. ,... ...., 
w 
X 
\ •:, ' '~'"I(_~-
Figure 3-22: Undo Flo\.\'chart 
70 
1 
? 
"'' 
I rl •, . «:, ,(-• •(_ 
' " ' 
'rhe .J\.ccusort project intcrfac(· can hf• thought of as a specific exarnple of 
the general interface design present<·d in the previous chapter. Because the 
range of functions to be performed by the Accusort interface is limited, rnuch of 
the flexibility of the gene1ral interfar.c (and the associated circuitry) is not 
implemented. llowever, there are definite paral1els between the two designs. 
Obviously, the use of FIFOs to queue data and to bridge the speed gap 
between the array and the host is the) sarne in both interfaces. Also, both the 
X-string load and the A-string load utilize parallel-load serial-output shift 
registers. Because available off-thr-shelf parts had to be used~ conventional 8-bit 
shift registers were used in lieu of the J) aral lei-to-Serial Converter of the general 
interface design. Nonetheless, the function performed by these discrete 
components strongly matches that of the general interface's Parallel-to-Serial 
Converter. Th c S \V itch Network in the genera I design is direct] y an a Igou s to 
the n1ultiplexer of figure 3-9 that routes A to the upper and lower cells of the 
matching n1odule circuit. 
There is no Pcrn1utation Network in the Accusort interface since output 
generated by the matching rr1odule _ circuit is of one format only. Output 
generated bit-serially by the matching module circuit is routed to an 8-bit 
seria1-i n put paralle]-ou t put shift register ( con1mercially av ai]ab]e). This is a very 
basic Output Stager circuit consisting of one Output St.ager J\1odule, since N == 1 
is the number of "rows~' (as defined in the general interface) of the array 
si1nul taneously gencrati ng output. The out put of this shift register is clocked 
.djrec.tly _int(t~!e~1~t;11.Jt ,FJ~Q~~ ·.( 
' 
,., .. , . 
..;;r ' ··-..>-' . -
.,,,.),, ~,",.(_:.,. r .-- . , 
.. 
71 
I<_' . .{'. , 1' ... , ,. I ,• 
Chapter 4 
Conclusion 
4.1 Sl111rmary of lm1>orta11t Results 
Having demonstrated the need for a high-speed general purpose interface to 
connect a host computer to a bit-sequentia] Systolic Array, this thesis has 
attempted to highlight important considerations in designing such a circuit. 
Many aspects of such a design pose significant hurdles to an efficient interface. 
These aspects, and solutions which ,vould yield an efficient design, were 
examined in depth and are summarized in the paragraphs below. 
The interface must accomodate differences in data formats and operating 
speeds bet\veen the host and the array. This implies that the interface have its 
O\Vn men1ory- to buffer bulk data transfers; its o\vn data staging circuitry- to 
... 
satisfy array 1/0 requirements; and its O\\'n controller- to intelligently interpret 
instructions f ron1 the host computer and configure itself accordingly. 
Great effort ,vas concentrated in the area of array 1/0, especia]ly in the 
efficient co11ection of array output.. The data collection and routing strategy 
presented in section 2.6.4 is optin1al and is applicable to any situation in which 
independent processes tirne-shar<' a single resource. As such, its potential 
application reaches far bryond use solely in a host-to-array interface. 
The rnisn1atch in band,vidths of the host, the array, and the interface 
me1nory requires a solution that is multi-[ aceted. Because the host and the 
array do i:iot operate synchronously~ FIFOs are employed to ensure that there 
are no _,.r11emory. access conflicts. . The proper selection of RAM memory word 
-t • . < , "- < . , <. -~ , ,. K <" .... ' . , ;;> .. 
• • -
, ~-- I,[ '.,. .~i,. v'k ~-,f'" . . ... , //~ .< .. 
. . ,. 
size (section,' 2. 7) is fundamental to the most. efficient data transfer between· host 
. " ,.,? 
' -
' '~.;. .. .. _ . 
.. ,. ' . 
ancJ array. 
Bit-slice n1icrorontrollcr technology proved a11 effective solution to the 
problcrn of interface control logic. This approach provides the flexibility, speed, 
and intelligence· to perform the necessary control functions, as demonstrated in 
section 3.6. 
Although time constraints prohibit our investigating all possible aspects of 
the design, the major points of a general interface have been addressed in this 
work. Areas which merit further consideration are discussed in the next section. 
4.2 Fl1ture Work 
The effect of changing array word size (NA) or host word size (N 8 ) on the 
Conditioner & Memory circuits deserves examination. Optimality for one set of 
NA and NH may be a miserable choice when one of the word sizes changes. A 
possible solution to this problem n1ay be the use of programmable Conditioner 
& Memory subsystems that can be reconfigured for a variety 0f host and array 
word sizes. 
Additional future \\'Ork should be focused on design of the controller, as 
this unit is the nervous system of the interface. As mentioned earlier, bit-slice 
ii 
rr1icrocontrollcrs have desirable properties and, therefore, deserve further study. 
Effort should also be concentrated on the design of algorithms and instruction 
sets for the microsequencers. 
In considering data forn1ats for the array 1/0, only word sizes that are a 
power of 2 were allowed. These word 
. 
sizes permit a wide range of 
computations to be performed and, so, are not terribly restr~ctive. However~ 
. '. . . ·~.~---~~/~'.~ ·_ . 
methods . for generating and~.: ~offe'cting :,: array words of··-·,1:1.ny "'. ·1si) . .'e co\\h:l ·· be 
investigated. 
73 
1\dditio11al attention should be J>aid to u11iv<!rs;d ~'bit shuffling ... v..1 ithin the 
array \vords. The Input Stager described in Chapter 2 is capable of performing 
rov..1-to-colurnn permutations. Others have studied rr1cthods of bit shuffling [9,10] 
and incorporating their techniques into the design of the Input Stager would 
make it- and, consequently, the general interface- more uni versa] and powerful. 
,. 
t·I#, •. , ·· . 
. ,.. ' - ' ,~. -~, -', - . 
74 
\ 
0 
' \ 
References 
II] Kung, H.T. and Leiserson, C.E. "Systolic arrays (for VLSI)", Sparse 
Matrix Symposium, 1978, Duff, I.S. and Ste,vart, G. W ., ed., SIAM, 
pp. 256-282, 1978. 
[2] Davis, R.1-I. and Thomas, D., "Systolic array chip matches the pace 
of high-speed processing", Electronic Desi·gn, pp. 207-218, Oct. 1984. 
[3] Kung, S. Y ., "On supercornputing with systolic/wavefront array 
processors", Proc. IEEE, vol. 72, pp. 867-884, July 1984. 
[4] Kung, S. \' ., "VLSI array processors", IEEE A..'iSP Magazine, vol. 2, 
pp. 4-22, July 1985. 
[5] Batcher, K.E., "Design of a massively parallel processor", IEEE · 
Trans. Comput., vol. c-29, pp. 836-840, Sept. 1980. 
[6] Batcher, K. E., "Staging rr1emory"', The Af assively Parallel Processor, 
Potter, J.L., ed., MIT Press, pp. 191-204~ 1985. 
[7] Hannaway, W .Fl., Shea, G., and Bishop, V\! .R., "11andling real-time 
images comes naturally to systolic array chip,., Electronic Design, pp. 
289-300, Nov. 1984. 
[ 8] Tewari , N . and \Va g h, iv1. , "Bit-sequential array for pattern 
matching"', Proc. IEEE, vol. 74, No. 10, pp. 1465-1466, Oct. 1986. 
[9] Bauer, L.H., "Implementation of data manipulating functions on the 
ST ARAN associative processor'", J>arallel Processing. Proceedings of 
the Sagarnore Cornpuler Conference, .4 ugust 20-23, 1914, Feng, T., 
ed., Springer-Verlag Berlin - Heidelberg, pp. 209-227, 1975. 
[ 10] Feng, T., "Data manipulating functions in parallel processors and 
their implementations", IEEE Trans. Co1nput., vol. c-23, pp. 309-318, 
Mar. 1974. 
:_ \c,' 
' 75 
·.-,,r,, .,l 
Vita 
Th e1 author was born on .Jan u a r y 2£5, l 9 5 9 in 1~ et h) eh er n, f> a. to Mr. and 
!v1rs. WiJliarn C. Seaman. After graduating f rorn Bethlehern Catholic High . 
School he attended Lehigh University and earned the Bachelor of Science degree 
in Electrical Engineering, graduating with honors in May, 1981. He then was 
employed by \\1 estinghouse Electric Corp. for four years as a design engineer 
before returning to Lehigh to pursue a Master of Science degree in Electrical 
Engineering. 
'' . ,dw'IP -f -
' ~c.":L..i. -,-- - ~ . .- -_'~<« ' ' - " . - '~< . 
,.. 
76 
0 
. ') 
