Two-level pipelined systolic array graphics engine by Jayasinghe, J.A.K.S. et al.
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO. 3, MARCH 1991 229 
Two-Level Pipelined Systolic Array 
Graphics Engine 
J. A. K. S. Jayasinghe, Member, IEEE, F. Moelaert El-Hadidy, G. Karagiannis, 
Otto E. Herrmann, Member, IEEE, and J. Smit 
Abstract -Simultaneous improvement of interaction speed and image 
quality of raster graphics systems is an important topic in computer 
graphics research. Due to huge processing power requirements, realistic 
shading methods (like Phong shading) have been used only in noninter- 
active applications whereas less realistic shading methods (like Con- 
stant shading) have been used in current interactive applications. A 
design of a systolic array graphics engine is described which generates 
high-quality Phong-shaded images in real time. Under the area and 
speed limitations of the current IC technologies, the required speed is 
achieved using pipelined functional units. A prototype containing nine 
processing elements was fabricated in a 1.6-pm CMOS technology. 
I. INTRODUCTION 
N THE past decade, raster graphics systems have become I more popular than vector graphics systems because of 
their better image quality. The frame buffer in the raster 
graphics systems has been identified as the major bottleneck 
for real-time interaction [l] due to its insufficient bandwidth. 
A VLSI systolic array graphics (SAG) engine called Super 
Buffer (only capable of Constant shading) was first intro- 
duced in 1985 [2] to replace the frame buffer by a processor 
array. More powerful SAG engines capable of Gouraud 
shading were introduced later [3], [4] which use 16-b fixed- 
point arithmetic. The main advantage of the SAG engine is 
its smaller overall system size compared to other graphics 
systems [2], and its potential for better interaction speed [5]. 
As the maximum operating speed of a systolic array is 
determined by the delay of the most complex operation, the 
maximum operating speed of an SAG engine tends to reduce 
as more complex functions are introduced to improve the 
image quality. Attempts to improve the speed using faster 
functional units (like carry lookahead adders instead of sim- 
ple carry:ripple adders) lead to larger silicon area. Since a 
large number of processing elements (PE's) is needed in an 
SAG engine, this solution is impractical. The purpose of this 
paper is to report a VLSI design of an advanced SAG engine 
Manuscript received July 27, 1990; revised November 1, 1990. This 
work is part of a project for developing a graphics workstation. The 
project is supported by the Dutch Research Foundation (STW) under 
Contract CW177.1249, carried out by the University of Twente, En- 
schede, The Netherlands, and the Center for Mathematics and Com- 
puter Science, Amsterdam, The Netherlands. 
J. A. K. S. Jayasinghe is with the Laboratory for Network Theory, 
University of Twente, 7500 AE Enschede, The Netherlands, on leave 
from the Department of Electronics Engineering, University of 
Moratuwa, Moratuwa, Sri Lanka. 
F. M. El-Hadidy, G. Karagiannis, 0. E. Herrmann, and J. Smit are 
with the Laboratory for Network Theory, University of Twente, 7500 AE 
Enschede, The Netherlands. 
IEEE Log Number 9041479. 
built from pipelined functional units which can generate 
realistic images interactively for high-resolution displays. 
This paper is organized as follows. In the next section, we 
introduce a structured frame store system as an environment 
for the advanced SAG engine. In Section 111, we present the 
principles and architecture of the advanced SAG engine. We 
introduce pipelined functional units into this SAG engine to 
meet the performance requirements. This is done by the 
formal approach presented in Section IV. Next, two architec- 
tures built from pipelined functional units are described in 
Section V. In Section VI, some details of a prototype are 
presented. Finally, some conclusions are drawn. 
11. A STRUCTURED FRAME STORE SYSTEM 
For the sake of completeness, a brief description of the 
structured frame store system is presented that incorporates 
the advanced SAG engine. The low-level display file (LDF) 
in Fig. 1 contains nonoverlapping facets (patterns) describing 
the visible image. These patterns are shaded in real time 
while being displayed on the screen. Since a large number of 
small patterns can occur in a typical scene, high bandwidth is 
required to access the LDF. This has been achieved by 
parallel accesses of the LDF by multiple pattern loaders 
(PL's). The patterns that are contributing to the next pixel 
row are stored in the active pattern store (AF'S). There are 
two sets of systolic arrays that work independently. One 
takes care of the incremental calculations along the pixel-col- 
umn direction, the systolic array preprocessing (SAP) engine, 
and the other along the pixel-row direction, the SAG engine. 
The SAP engine calculates the intersections of edges with 
the current pixel row, the color values at the leftmost edge, 
incremental color values along the pixel-row direction, etc. 
These data are sent into the scan-line command buffer 
(SCB) as instructions and data for the SAG engine. In the 
SAG engine, shading is done by incremental calculations. 
The output of the SAG engine is directly used to refresh the 
display. 
This architecture offers attractive features such as expand- 
ability, linear performance improvement with increased 
hardware, efficient implementation in VLSI, and the ability 
to generate high-quality pictures with a good interactive 
behavior. 
111. THE ADVANCED SAG ENGINE 
In this section, we introduce an advanced instruction set 
together with the corresponding architecture of an SAG 
engine. Before that, some notes on shading are made to 
0018-9200/91/0300-0229$01.00 61991 IEEE 
230 
Instruction 
REF(  1 
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO. 3, MARCH 1991 
Description 
Send the local pixel storage to the display and reset the processor. 
TABLE I 
OUR ADVANCED INSTRUCTION SET 
EVALO( X ,  DX,  I )  
EVALI (X ,  DX,  I ,  D I )  
EVALZ(X, DX,  DDI, DI,  I )  
Interpolate and accumulate the intensity between pixel locations X + 1 
and X + DX + 1 until the next REF( ). Zero-, first-, and second-order 
interpolation are done by the EVALO( . . 
EVALZ( . . . ) instructions, respectively. 
EVALI( . . ‘1, and 
SETPI(X, D X , I )  
SETPDI(X, DX,  D I )  
SETPDDI(X, DX,  DDI)  
SETI(X,  I )  
SETDI(X,  D I )  
SETDDI( X ,  DDI 
DIS(X ,  D X )  
Correct the periodic discontinuities during the next interpolation at pixel 
locations DX apart, starting from X + 1. The intensity, its first 
derivative, and its second derivative are corrected by the SETPI(.  . . ) 
SETPDI( . . . ), and SETPDDI( . . ) instructions, respectively. 
Correct the discontinuity during the next interpolation at pixel location 
X + 1. The intensity, its first derivative, and its second derivative 
are corrected by the SETI( . . . ), SETDZ( . . . ), and SETDDI(.  . . ) 
instructions, respectively. 
Disable the accumulation between pixel locations X + 1 and X + DX + 1. 
ACC-M( ) I Toggle the accumulation of negative intensities. 
From a higher image 
generation level 
I 
APS : Active Pattem Store 
LDF : Low-Level Display File 
PE-a : Processor Element of SAP 
PE-b: Processor Element of SAG 
SAG: Systolic h a y  Graphics 
(Engine) 
SAP : Systolic Array Re- 
Processing (Engine) 
SCB : Scan-Line Command Buffer PL : Pattem Loader 
Fig. 1. Architecture of the structured frame store system. 
highlight the power of the new instruction set and 
architecture. 
A. Some Notes on Shading 
Shading enhances the realism of computer-generated 
the 
im- 
ages. There are several well-known shading techniques such 
as Constant shading, Gouraud shading, and Phong shading, 
mentioned in order of increasing complexity [6]. Constant 
shading calculates a single intensity value for an entire facet. 
Gouraud shading linearly interpolates the intensity along the 
edges and between the edges. Phong shading interpolates 
some vectors and calculates a unit vector dot product raised 
to a power. The image quality that Phong shading offers is 
superior to Gouraud shading but greatly increases the cost of 
computation. We conclude that shading methods providing 
realistic images need huge processing power. In order to 
achieve the required interaction speed, less realistic shading 
methods (like Constant shading) have been used in current 
interactive applications, whereas more realistic shading 
methods (like Phong shading) have been used only in nonin- 
teractive applications. 
We have found that Phong shading can be approximated 
by second-order intensity interpolation with changes to the 
second derivative of the intensity at some strategic points 
during interpolation [5]. This approach dramatically reduces 
the computational power requirements. It introduces no 
visible degradation in image quality, because it calculates the 
intensity values to the same accuracy as the conventional 
Phong shading produces. A 10 x 10-pixel facet can be shaded 
using a factor of 10 less processing power, whereas for a 
1OOX 100-pixel facet the saving is a factor of 25. Though 
Phong shading was considered to be computationally too 
expensive for real-time applications, the simplicity of the 
second-order interpolation scheme makes real-time Phong 
shading not only feasible but also faster than Gouraud shad- 
ing. When Phong shading is used, one can increase the sizes 
of facets without losing the quality of the image. This re- 
duces the number of facets needed to describe an image. 
The fewer the number of facets, the smaller the amount of 
processing power required. Our estimate shows that by mak- 
ing the facets more than a factor of 4 larger, we can make 
Phong shading faster than Gouraud shading. 
B. The Instruction Set 
A generalized interpolation scheme, which performs zero-, 
first-, or second-order interpolation with discontinuities in 
intensity and/or derivatives of the intensity, provides a uni- 
fied approach to support several shading methods. Table I 
shows our instruction set for this approach which can be 
executed on an SAG engine. The EVAL * ( X ,  DX, . . . ) in- 
structions perform the intensity interpolation between pixel 
locations X + 1 and X + DX + 1, and the SET * ( . . . ) in- 
structions are used to set or compensate for the discontinu- 
ities at the required locations. The REF( ) instruction is sent 
JAYASINGHE et al.: TWO-LEVEL PIPELINED SYSTOLIC ARRAY GRAPHICS ENGINE 23 1 
~ 
Shading technique Instruction Description 
L, SETDDI(X4-1.DDI2) Secondader derivative correction a t X 4  
E V A ~ ( X l - 1 . 4 D X , O . O . D D ~ ~ ~ S ~ n d ~ r  interpolation from X I  to X I  +4DX 
8 - 1  
4DX 
xo X l  x2 x3 x4 x5 xs 
Sludlng h a t s  Having Holu 
____________________-_------------------------ 
DlS(X2-1,DX) 
EVALIfX1-1.4DX.LDIJ 
Disable the accumulation from X 2  to X 2 + D X  
Fmt-orda interpolation from X I  t o X Z + 4 D X  
e -  L I  
4DX ____________________------_--------------_---- 
Gouraud Shadlng 
+ AnU-Allaslng 
, , , , , , I  
SETDI(X2-1,DIlJ 
SETDI(X4-1.DI2) 
Fmt-ordex derivative correction at X 2  
Fust-order derivative correction at X 4  
, EVALlfX1-1.4DX,I.DI) Fust-order interpolation from X I  to X I  + 4 D X  
, , H I , ,  
XO X1 X2 X3 X4 XS X6 
DX 
Fig. 2. Implementing some shading algorithms using our instruction set. 
into the SAG engine in synchronism with the display refresh- 
ing. It sends the pixel value of the current PE it resides in to 
the display and resets each PE as it travels through the 
processor array. The instructions required to shade the cur- 
rent pixel row are sent into the SAG engine between the 
REF( ) instructions, due for the previous and current pixel 
rows. When fewer objects (than the capacity of the SAG 
engine) have to be displayed, the empty slots between the 
REF( ) instructions are filled by NOP( ) instructions. The 
DZS( . . . ) instruction disables the accumulation of intensi- 
ties. Facets having holes can be shaded efficiently by sending 
this instruction before an EVAL * ( . . . ) instruction. The 
ACC-M( ) instruction toggles the accumulation of negative 
intensities. Using this instruction, one can eliminate the back 
facing parts of a facet. Fig. 2 shows the implementation of 
some shading algorithms using our instruction set. 
C. The Architecture 
In an SAG engine, the pixel storage, which is limited to a 
pixel row, is distributed over the one-dimensional systolic 
array built from identical PE's. Processing and storage of 
pixels in each pixel column is done sequentially by a single 
PE, such that adjacent pixel columns are taken care of by 
adjacent PE's, minimizing communication requirements. 
During each processor cycle of our SAG engine, the PE 
containing the REF( ) instruction sends the pixel value in 
register P (see Fig. 3) to the display. As the REF'( instruc- 
tion is passed in synchronism with the display refreshing, the 
video stream is generated on the fly and the display resolu- 
Vin . ...
Din Dout 
A,B,C: Registers Iin : Instruction input P : Pixel Storage 
BUF : Buffer Iout : Instruction output Vi : Video input 
Din : Data input MUX: Multiplexer Vout : Vidw output 
Dout : Data output OD : Output Driver 
Fig. 3.  Architecture of a PE in the advanced SAG engine. 
tion determines the clock frequency. Other PE's perform 
operations according to their current instructions. The con- 
troller decodes the instruction to enable proper functional 
units in the PE. At the end of the processor cycle all but the 
last PE pass instructions to their neighbors and the first PE 
receives a new instruction. Before the instructions and data 
are sent to the neighboring PE's, they are modified accord- 
ing to a known criterion. The registers A ,  B ,  and C store the 
intensity I ,  its first derivative DZ, and its second derivative 
DDZ. Data are represented by 36-b fixed-point numbers' for 
interpolations and discontinuity corrections. The processor 
addresses, X and DX, are represented by 12-b integers. The 
'At least ( m  +2n)-bit fixed-point representation is needed to prevent 
visible quantization errors, where 2"' and 2" are intensity and horizontal 
display resolutions. 
232 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO. 3, MARCH 1991 
processor addresses (i.e., X and O X ) ,  DDZ, DI, and Z are 
sent into the array in consecutive time slots in the given 
order. For fault tolerance reasons, the processor location X 
is identified by decrementing the address X at each PE and 
detecting whether its value is zero or not. Processor locations 
X + DX,  X + 2 DX,  X + 3DX, . . . can also be similarly iden- 
tified, by substituting X by DX whenever X is zero. Faulty 
PE's are bypassed by disabling the decrementing of X and 
bypassing the instructions and data. As the data associated 
with instructions are sent in different time slots, the proces- 
sor address decrementation, intensity interpolation, and in- 
tensity accumulation can be done on the same adder. The 
condition X = 0 can be detected by monitoring the carry-out 
of the adder at the 12th bit (i.e., C, ,  in Fig. 3 )  when X is 
represented by the lowest significant 12 bits. The leftmost 
multiplexer provides the data for decrementing X .  The 
rightmost multiplexers select the correct output data and op 
codes depending on the decisions related to the processor 
addresses. 
IV. A FORMAL APPROACH FOR TWO-LEVEL PIPELINING 
Pipelined functional units are shown to be very attractive 
in achieving the necessary performance requirements under 
the speed and area limitations of the current IC technolo- 
gies. Systolic arrays built from pipelined functional units 
have been referred to as two-level pipelined systolic arrays in 
the literature [7]. When the behavior of a systolic array is 
time invariant (i.e., each PE performs the same function in 
every cycle) and shift invariant (i.e., the output data of a PE 
are independent of the PE location in the array), it can be 
converted into a two-level pipelined design by using the 
formal methodology presented in [7]. Due to the time-vari- 
ance and shift-variance behavior of the advanced SAG en- 
gine, this formal methodology fails to produce functionally 
correct designs [SI. Therefore, we use the following graph- 
theoretic approach, which can be used for two-level pipelin- 
ing of systolic arrays irrespective of their behavior. 
In our approach, the original systolic array is represented 
at bit level by a finite, vertex-weighted, edge-weighted, di- 
rected graph G = ( V ,  V', E ,  d,, dbt, w E )  (from now on, for 
simplicity, we say graph) where V U V' and E are the set of 
vertices and the set of edges, respectively. The functions d,, 
di,,, and wE represent the weights defined on vertices in V ,  
vertices in V', and edges in E ,  respectively. The graph G is 
constructed by replicating the graph of a PE, G,,= 
( VPE, V;,, E,,, dVPE, dbfiE, wE,,) and connecting vertices ac- 
cording to the communication between PE's as the entire 
systolic array is built from identical PE's. 
In order to construct the graph G,,, the bit-level func- 
tional units are represented by vertices V,, and bit-level 
stwage by vertices V;,. The communication between ver- 
tices is denoted by the edges and for each edge e E E,,, 
d e )  denotes the earliest communication time slot for all 
legal combinations of instructions. The edges which commu- 
nicate in the ith ( i  = 0,1,2, . . . ) time slot are weighted by i .  
As any vertex U E VPE represents a functional unit, the data 
sent on an output edge of U are dependent on the data 
supplied on several input edges of U .  Though the propaga- 
tion delay from each input edge to an output edge is differ- 
ent, for simplicity, we assume that the propagation delay for 
each vertex is equal to the worst propagation delay of it. 
Hence, we weigh each vertex U E V,, by d(u), the maximum 
numerical propagation delay of the functional unit repre- 
sented by the vertex U .  In general, storage vertices represent 
multiport registers. Therefore, the data coming from an 
input edge are stored and passed to a selected output edge. 
The output edge is selected by another input edge represent- 
ing a control signal. Therefore, we weigh each vertex U '  E V;, 
by d ' (u ' ,u l ,u2)  for all vertices u1 and u2 such that vertex u2 
receives data from vertex u 1  through vertex U'. The quantity 
d'(u', u I , u 2 )  indicates the minimum latency (which is under 
the control of instructions) through the storage vertex. If the 
data supplied to vertex u2 from vertex U' have no depen- 
dency on the data supplied to vertex U' from vertex c1, then 
d'(u', ul ,  U,) is undefined. 
The maximum clock speed of the circuit is determined by 
the propagation delays of the critical path(s). A critical path 
in our graph G can be identified as a directed path activated 
in the same time slot such that the sum of the vertex weights 
on that path is maximum over the entire graph. If a critical 
path goes through a storage vertex, the critical path is 
terminated at the storage vertex whenever the vertex weight 
corresponding to that path is nonzero, as the vertex weights 
of storage vertices denote the latency whereas the weights of 
other vertices denote the propagation delay. In order to 
improve the clock speed, the graph is retimed. In retiming, 
pipeline registers are added to all critical paths to meet the 
given speed requirements, and then additional pipeline regis- 
ters are added to other edges and/or latencies at storage 
vertices are changed such that the conditions of the following 
theorem are satisfied. 
Two-Leuel Pipelining Theorem: If a two-level pipelined de- 
sign is obtained by adding pipeline registers to some edges 
and/or changing the latencies of storage vertices of G,  the 
logical behavior of the system will be kept intact if the 
differences in latencies through any pair of paths between 
any two vertices are equal in the original and retimed graphs 
when the weights of the storage vertices corresponding to 
these paths are not undefined. 
Proof: Let m and n be two vertices and p 1  and p 2  be 
two paths from m to n. Assume path p 1  is activated on 
~ ~ ~ , , z ~ ~ ~ , ~  . . , i p l N ,  time slots ( i p l l  < i p l ,  . . . < i p l N , )  and p z  is 
activated on i p21 , ip22 , .  . .,iP2,,  time slots ( i p 2 ,  < i P z 2  . . . < 
ip2, ,) .  We can make this assumption iff this path does not 
contain edges connected to a storage vertex such that the 
vertex weight is undefined. If we introduce a latency of k 
cycles in path p l ,  by introducing pipeline stages and/or 
changing storage latencies, vertex m gets the data from 
vertex n through path p1 in time slot i p l N l  + k .  In the 
original graph, vertex m gets the data from vertex n through 
paths p1 and p 2  in time slots i p l N l  and ip2,,, respectively. In 
order to keep the logical behavior of the graph intact, we 
must get data through path p 2  in time slot ip2,, + k .  Now, 
we can see that the difference in latency through path p1 
and p 2  is equal to i p ,N ,  - ip2, ,  cycles in both original and 
retimed graphs. This argument, when applied to all paths 
0 
As any circuit in a graph can be described in terms of a set 
of linearly independent circuits, it is enough to apply the 
two-level pipelining theorem to a set of linearly independent 
circuits of G. This produces a set of equations that contains 
more variables than the number of equations. Therefore, 
feasible solutions can be found by linear integer program- 
ming, for example, by minimizing the total register count. 
. .  
between any vertex pair, proves the theorem. 
233 JAYASINGIIE et al.: TWO-LEVEL PIPELINED SYSTOLIC ARRAY GRAPHICS ENGINE 
Vi n 
Din 
Vout 
PDout 
Vi n 
Din 1 , 4 Vout Dout 
lout 
0 : Pipeline Registers 
Fig. 4. A Group-X architecture where the pipelining depth is limited. 
V. TWO-LEVEL PIPELINED DESIGNS OF THE 
ADVANCED SAG ENGINE 
Due to limited space, we cannot present the full details of 
the conversion of our SAG engine into two-level pipelined 
versions. The necessary steps to convert the advanced SAG 
engine into two-level pipelined designs are: 
1) construct the graph G ,  
2) identify the critical paths and add pipeline registers 
3) apply the two-level pipelining theorem and get a func- 
into them to meet the given speed requirements, 
tionally correct and feasible design. 
Due to the difference in bit requirements for processor 
addresses (i.e., X ,  D X )  and intensity data (i.e., Z, DZ, DDZ), 
we get two groups of two-level pipelined architectures for 
our advanced SAG engine. We present these architectures in 
the following subsections. 
A. The Group-X Architecture 
If the speed requirements are such that only a few pipeline 
registers are necessary on the carry path, we can encounter a 
situation where no pipeline registers are required on the part 
of the adder in which the addresses are updated. We refer to 
this configuration as Group-X architecture. Fig. 4 shows an 
example where only two groups of pipeline registers have 
been placed on the 36-b data path. This divides the 36-b data 
path into three blocks of 12 b each. Let us assume that the 
instruction Zin and the first 12-b block of the input data Din 
are supplied to the processor at the nth cycle. The instruc- 
tion provides the proper control signals to the data path such 
that the correct inputs are provided to the adder. As the 
processor addresses are represented by 12-b numbers and 
sent into the processor on the first 12-b block of the data 
path, any decision related to a processor address can be 
taken in the same time slot as the address is decremented. 
We recall that the rightmost multiplexers select the proper 
output data depending on the decisions related to the pro- 
cessor addresses. Therefore, the output instruction Z,,, and 
the first 12-b block of the output data Do,, can be produced 
at the nth cycle. In the case of the second 12-b block, the 
carry input to the adder is delayed by one clock cycle due to 
the first pipeline register on the carry path. Hence, the input 
and output activities of this adder must occur at the ( n  + 1)th 
cycle. The following conditions are required for the second 
12-b block: 
Fig. 5. A Group-Y architecture where the pipelining depth is 
unlimited. 
1) the control signals which provide the inputs to the 
adder must be delayed by one cycle with respect to the 
relevant control signals of the first 12-b block; 
2) the control signals which use the output of the adder 
must be delayed by one cycle with respect to the 
relevant control signals of the first 12-b block; 
3) the input data Din must be supplied at the (n + 1)th 
cycle. 
The set of pipeline registers between the first and second 
12-b block has been inserted to meet the first and second 
requirements. As the second section of the adder is generat- 
ing its output at the (n + 1)th cycle, the second 12-b block of 
the output data Dout is generated on the ( n + l ) t h  cycle. 
Similarly, the pipeline registers between the second and 
third 12-b blocks of the data path are inserted. For proper 
operation, the last 12-b block of the input data Din must be 
supplied in the (n +2)th cycle and the last 12-b block of the 
output data Dout will be generated in the (n +2)th cycle. 
Now we can see that the skews of different blocks of the 
input data Din and output data Dout are compatible with 
each other such that two processors can be cascaded. In the 
case of the architecture in Fig. 3, the speed improvement of 
Group-X architectures is limited to a factor of 3. 
B. The Group-Y Architecture 
When the speed requirements are severe, more pipeline 
registers must be placed on the carry ripple path of the 
adder. As soon as pipeline registers are introduced into the 
section of the adder where the address is decremented, any 
decision related to a processor address cannot be taken in 
the same cycle as the least significant bit of the address is 
supplied. We refer to this configuration as Group-Y archi- 
tecture. Fig. 5 depicts an example, where pipeline registers 
are placed 4 b apart. When the address is represented by a 
12-b number, any decision related to the address has to be 
postponed by two cycles. Hence, the inputs to the rightmost 
multiplexers must be delayed by two cycles. The pipeline 
registers on the horizontal buses provide the necessary de- 
lays. Similar to the Group-X architecture, if there is a 
pipeline register between the kth bit and ( k  + 11th bit of the 
adder, pipeline registers are necessary on all the control 
signals which provide/use the data to/from the ( k  + 1)th bit 
of the adder. Furthermore, the input data Din must be 
skewed. We notice that the latency between any input port 
to the relevant output port has been increased by two cycles. 
234 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO. 3, MARCH 1991 
In general, if there are k pipeline registers on the section of 
the adder where the addresses are updated, k pipeline Pixel rate 
registers are necessary on the horizontal buses. As C,, is an 
input to the controller, some pipeline registers are also 
necessary in the controller. The speed improvement of this 
1-b adder is the limiting factor. 
Characteristic 
Transistor count 
Chipsize 
Power dissipation 
Package 
group is superior to the previous group, and the speed of a Power supply 
outp3I t l  
RD1 
RD2 
OutgoM 
-
Prototype Optimized Design 
66 - 83 MHz 
85K 
9 
1.6-km CMOS 1 . 6 - ~ m  CMOS 
1 0 . 2 ~  11.4 mm2 
5v 5v 
< 5 W at 83 MHz 
144 PGA 144 PGA 
66 - 83 MHz 
190 - 230K 
50 - 60 
1 0 . 2 ~  11.4 mmz 
< 5 W at 83 MHz 
Fig. 7. Microphotograph of the prototype SAG engine. 
U 
RD2 
Vs WR V& 
TABLE I1 
AND THE EXPECTED CHARACTERISTICS OF AN OPTIMIZED 
DESIGN (IN DOMINO LOGIC) 
(b) 
multiport register used for registers A,  B ,  and C. 
Fig. 6. Some circuits used in the prototype. (a) The adder. (b) The THE CHARACTERISTICS OF THE PROTOTYPE (IN FULL CMOS) 
VI. A PROTOTYPE DESIGN 
same type can be implemented in a single diffusion area. A 
BUF. The gate capacitance of an inverter has been used as 
the storane element for the registers as the maximum 
Due to the superior 'peed Of Group-Y architectures, we two-phase serial shift register was used for the input buffer 
have decided to implement the architecture given in Fig. 5 in 
silicon. The following steps were followed during the design: 
hardware description of the SAG engine to verify its 
behavior with the specifications; 
two-level pipelining with estimated propagation delays; 
hierarchical decomposition and floor planning; 
leaf cell design using symbolic layouts; 
circuit extraction for functionality verification of leaf 
cells and better propagation delay estimation by SPICE; 
refinement of the design by repeating steps 2-5; 
design rule checking, electrical rule checking, and then 
refinement of the design; 
circuit extraction to verify the functionality of the com- 
plete design by switch-level simulation and verification. 
For the hardware description, the in-house developed hard- 
ware description language MoDL [9] was used. For the 
symbolic layout design, the symbolic layout design system 
CAMELEON [lo] was used. The procedural design language 
GrapMG [ll] was used to build the processor in a hierarchi- 
cal form using the custom designed leaf cells. The design 
rule checking and electrical rule checking were done by the 
tools in the DRACULA design system. 
Most leaf cells were designed such that they can be abut- 
ted. The adder is shown to be the critical cell which deter- 
mines not only the speed of the processor but also the area 
of a PE to considerable extent. The adder shown in Fig. 6(a) 
was used due to its compactness, as ail the transistors of the 
time is a-few microseconds. The registers have several 170 
ports and transmission gates were used to select the proper 
ports. Fig. 6(b) depicts the circuit of the multiport register 
used for the registers A ,  B ,  and C. The multiplexers are also 
based on the transmission gates. In the refined design (step 
6), the width of the first section of the data path was reduced 
to 3 b due to instruction decoding delays. For regularity, we 
decided to implement the data path as 3-, 5-, 4-, 3-, 5-, 4-, 3-, 
5-, and 4-b-wide sections. Fig. 7 is a microphotograph of the 
prototype which consists of nine PE's. It contains the equiva- 
lent of 85K minimum feature size transistors in a 1.6-pm 
CMOS technology and the design was done in a university 
environment (see Table I1 for more details). The area over- 
head of the pipeline registers is approximately 25%. Due to 
relatively low processor count per chip, we have investigated 
an improved design. Using Domino logic and better circuits, 
the transistor count per PE can be reduced by approximately 
a factor of 2.3. When cells are abutted, some of them have to 
be stretched. Better layouts can minimize area overhead due 
to cell stretching. According to our estimate, the density can 
be improved by approximately a factor of 2.5 using better 
mask layouts. Therefore, given adequate resources, 50 - 60 
PE's could be integrated in a single chip using the same 
1.6-pm CMOS technology without any speed reductions. 
Submicrometer technologies could provide even better re- 
sults. 
I 
I 
JAYASINGHE et al.: TWO-LEVEL PIPELINED SYSTOLIC ARRAY GRAPHICS ENGINE 235 
The testability of the design is an important aspect. The 
advanced SAG engine was designed such that each processor 
can be tested individually by bypassing the other processors. 
The signals of the test PE are observed via a serial shift 
register and are controlled via the ports Zi, and Din. The 
prototype chip has been tested and it is fully functional. 
VII. CONCLUSIONS 
By converting the computationally intensive Phong shad- 
ing method into second-order interpolation, it is now possi- 
ble to  generate images faster than Gouraud shaded images 
using the same amount of processing power. Therefore, 
real-time Phong shading becomes a reality. Due to  the 
robustness of our approach, no visible errors are introduced. 
The simplicity of our approach enables significant speed 
improvements for Phong shading even with general-purpose 
hardware. The speed improvements can be further enhanced 
by ASIC’s and hence we designed an advanced SAG engine. 
A silicon implementation of a prototype SAG engine sup- 
porting an advanced instruction set for real-time Phong 
shading is reported. Apart from Phong shading, it also sup- 
ports Gouraud shading and Constant shading. The speed of 
the advanced SAG engine is improved by two-level pipelin- 
ing. The  two-level pipelined design is derived using the 
formal approach presented in Section IV which handles 
time-variant and shift-variant systolic arrays. The advantage 
of the two-level pipelining is its capability to  provide complex 
functionalities a t  high pixel rates, which is difficult to  achieve 
by other means using the same silicon area. As computer 
graphics users have a great desire for high image quality, 
high interaction speed, and high resolution, we think that 
two-level pipelined S A G  engines supporting realistic shading 
I techniques will be a breakthrough for real-time computer 
graphics. 
ACKNOWLEDGMENT 
Thanks to  J. Huisken and other members of the VLSI 
Design Group at Philips Research Laboratories, Eindhoven, 
for their help and for allowing us to  use their design system 
and fabrication process. The members of the  Interactive 
Systems Group, CWI, Amsterdam, are also acknowledged 
for the discussions during the specification development 
phase of the advanced SAG engine. Thanks also to  S. Gerez 
and K. Slump at the University of Twente, Enschede, for the 
critical remarks and proofreading. 
REFERENCES 
[l]  P. J. W. ten Hagen et al., “Display architecture for VLSI-based 
graphics workstation,” in Advances in Graphics Hardware I .  
Berlin: Springer-Verlag, 1987. 
[2] N. Gharachorloo and C. Pottle, “SUPER BUFFER: A systolic 
VLSI graphics engine for real time raster image generation,” in 
Proc. 1985 Chapel Hill Con5 VLSI, 1985, pp. 285-305. 
[3] N. Gharachorloo et al., “Subnanosecond pixel rendering with 
million transistor chips,” in Proc. SIGGRAPH, Aug. 1988, pp. 
[4] T. Nishizawa et al. “A hidden surface processor for 3-dimen- 
sion graphics,” in ISSCC Dig. Tech. Papers, Feb. 1988, pp. 
[5] J. A. K. S. Jayasinghe et al., “A display controller for a 
structured frame store system,” in Advances in Graphics Hard- 
ware III. Berlin: Springer-Verlag, 1989. 
[6] J. D. Foley and A. van Dam, Fundamentals of Interactive 
Computer Graphics. 
41-49. 
166- 167. 
Reading, MA: Addison Wesley, 1984. 
H. T. Kung and M. S. Lam, “Fault-tolerance and two-level 
pipelining in VLSI systolic arrays,” in Proc. Conf. Advanced 
Res. VLSI ,  Jan. 1984, pp. 74-83. 
J. A. K. S. Jayasinghe and 0. E. Herrmann, “Two-level pipelin- 
ing of systolic array graphics engines,” in Aduances in Graphics 
Hardware N. Berlin: Springer-Verlag, 1990. 
J. Smit et al., “The MoDL hardware design system,” in Proc. 
8th Int. Conf. Computer Hardware Description Languages and 
Their Applications, Apr. 1987, pp. 327-342. 
K. Croes et al., “CAMELEON, A process tolerant symbolic 
layout system,” in Proc. European Solid-State Circuits Conf ., 
Sept. 1987, pp. 193-196. 
H. Jansen et. al., “GrapMG: Cost effective module generation,” 
in Proc. European Solid-state Circuits Conf., Sept. 1989, pp. 
86-71. 
versity of Twente, Ensc 
ests are parallel proces 
electronics. 
J. A. K. S. Jayasinghe (S’86-M’88) was born in 
Colombo, Sri Lanka, in 1960. In 1984 he re- 
ceived the B.Sc. (engineering) degree in elec- 
tronics and telecommunication engineering from 
the University of Moratuwa, Sri Lanka. In the 
same year, he joined the academic staff of the 
same university as an Assistant Lecturer. In 
1987 he received the M.E.E. degree from the 
NUFFIC (Netherlands University Foundation 
for International Cooperation). Currently he is 
working towards the Ph.D. degree at the Uni- 
hede, The Netherlands. His main research inter- 
,sing systems for computer graphics and medical 
F. Moelaert El-Hadidy was born in Prague, 
Czechoslovakia, in 1959. In 1982 she received 
the B.Sc. degree in communication and elec- 
tronics engineering from the University of Cairo, 
Egypt. In December 1986 she received a 
Diploma from the Philips International Insti- 
tute, Eindhoven, The Netherlands. In 1989 she 
received the M.Sc. degree from the University 
of Cairo. She is currently working towards the 
Ph.D. degree at the University of Twente, En- 
schede, The Netherlands. 
She worked as a software engineer in Egypt from 1982 to September 
1985. She then worked in the Philips CAD center in Eindhoven on an 
analog circuit analysis package until August 1987. Her interests include 
parallel architectures, systolic arrays, graphics, and ASIC’s. 
design developing 
student in electron 
G. Karagiannis was born in Braila, Rumania in 
1964. In 1987 he received the “Technologos 
Mihanikos” degree in electronics from the 
Technical Institute of Education (T.E.1.) of 
Athens, Greece, In 1988 he received the Ing. 
degree in electrical engineering from the Poly- 
technical High School (H.T.S.) of Enschede, 
The Netherlands. 
In the same year he joined the staff of the 
University of Twente, Enschede, The Nether- 
lands. Currently he is working in the field of IC 
a graphics workstation. He is a part-time MSc. 
ics at the same university. 
Otto E. Herrmann (M’72) was born in Diissel- 
dorf, West Germany, in 1933. He received the 
Dipl.-Ing. and Dr.-Ing. degrees in electrical en- 
gineering from the University of Technology 
Aachen, West Germany, in 1959 and 1965, re- 
spectively. In 1971 he received the “venia leg- 
endi” for telecommunication from the Univer- 
sity of Erlangen, West Germany. 
From 1959 to 1965 he was a Research Associ- 
ate and Lecturer at the Universities of Aachen 
and Karlsruhe. From 1966 to 1972 he was Se- 
236 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO. 3, MARCH 1991 
nior Staff Member and from 1972 to 1975 “Abteilungsvorsteher” in the 
Department of Electrical Engineering of the University of Erlangen. 
During 1972 he was a Visiting Scientist on leave at Bell Laboratories, 
Murray Hill, NJ. Since November 1975, he has been heading the group 
for network theory, signal processing, and CACSD as a full Professor at 
the Faculty of Electrical Engineering at the University of Twente, 
Enschede. The Netherlands. 
J. Smit was born in 1944 in Berkhout, The Netherlands. He re- 
ceived the Ing. degree in electrical engineering in 1966 from the tion in the area of VL 
Polytechnical High School (H.T.S.) in Emchede, 
The Netherlands, and the B.Sc. degree in elec- 
trical engineering in 1971 and the Ir. degree in 
electrical engineering in 1975 from the Univer- 
sity of Twente, Enschede, The Netherlands. 
Since 1971 he has been working at the Uni- 
versity of Twente at several levels of responsibil- 
ities. Currently he is a senior staff member 
(“Universitair hoofd docent”) in the Laboratory 
for Network Theory at the University of Twente, 
where he is responsible for research and educa- 
S I  design. 
