Method and device for maximizing memory system bandwidth by accessing data in a dynamically determined order by Wulf, William A. et al.
I11111 111ll Il1 Il11 III III III 11111 III III 1ll111111 
US006154826A 
United States Patent [19] [ i l l  Patent Number: 6,154,826 
Wulf et al. [45] Date of Patent: Nov. 28,2000 
[54] METHOD AND DEVICE FOR MAXIMIZING 
MEMORY SYSTEM BANDWIDTH BY 
ACCESSING DATA IN A DYNAMICALLY 
DETERMINED ORDER 
[75] Inventors: William A. Wulf, Charlottesville, Va.; 
Sally A. McKee, Portland, Oreg.; 
Robert Klenke, Charlottesville, Va.; 
Andrew J. Schwab, Raleigh, N.C.; 
Stephen A. Moyer, Atlanta, Ga.; James 
Aylor, Charlottesville, Va.; Charles 
Young Hitchcock, E. Thetford, Vt. 
[73] Assignee: University of Virginia Patent 
Foundation, Charlottesville, Va. 
[21] Appl. No.: 08/808,355 
[22] Filed: Feb. 28, 1997 
Related U.S. Application Data 
[ 631 
[51] 
[52] 
[58] 
Continuation-in-part of application No. 081340,740, Nov. 
16, 1994, abandoned. 
Int. C1.7 ........................................................ G06F 9/26 
U.S. C1. .......................... 711/217; 7111204; 7111167; 
7111169 
Field of Search ................................. 3451418; 70111; 
3951500; 7111202, 100, 101, 109, 110, 
163, 167, 132, 217, 204, 169; 710152, 53 
~561 References Cited 
U.S. PATENT DOCUMENTS 
4,325,120 411982 Colley et al. ........................... 7111202 
CENTRAL 
PROCESSING 
4,821,181 411989 Iwasawa et al. ........................ 3951702 
5,313,636 511994 Noble et al. ................................ 70711 
5,606,207 211997 Tomassi et al. ........................ 3451418 
OTHER PUBLICATIONS 
Tanenbaum, Structured Computer Organization, Second 
Edition, 1984, pp. 1CL12. 
Primary E x a m i n e r a .  James Peikari 
[571 ABSTRACT 
A data processing system is disclosed which comprises a 
data processor and memory control device for controlling 
the access of information from the memory. The memory 
control device includes temporary storage and decision 
ability for determining what order to execute the memory 
accesses. The compiler detects the requirements of the data 
processor and selects the data to stream to the memory 
control device which determines a memory access order. 
The order in which to access said information is selected 
based on the location of information stored in the memory. 
The information is repeatedly accessed from memory and 
stored in the temporary storage until all streamed informa- 
tion is accessed. The information is stored until required by 
the data processor. The selection of the order in which to 
access information maximizes bandwidth and decreases the 
retrieval time. 
7 Claims, 65 Drawing Sheets 
SCALAR ACCESSES 
- 14
STATE 
SCHEDULING 
UNIT 
STATE 
STREAM BUFFER 
CACHE 
MEMORY 
MEMORY 
MEMORY 
MEMORY 
https://ntrs.nasa.gov/search.jsp?R=20080004039 2019-08-30T02:10:05+00:00Z
U S .  Patent 
>- 
LT 
O N  
W 
I 
I 4 
-r 
Nov. 28,2000 
I 
& 
I ml O N  
W z 
-r 
Sheet 1 of 65 
>- 
LT 
O N  z 
W z 
l- 
I I 
6,154,826 
T T [r W LL k J m k  
I 
W 
I 
0 
$ :I 
U S .  Patent Nov. 28,2000 Sheet 2 of 65 6,154,826 
r 
0 -I 
I- z 
0 
0 
1 
a 
a n 
I- 
KJ cn 
W 
IT 
O n a 
I 
U S .  Patent Nov. 28,2000 
z z  
>[r 
w a  
w n  
Sheet 3 of 65 
1 
6,154,826 
U S .  Patent Nov. 28,2000 
2 
Sheet 4 of 65 
0 m cv 
0 0 
m 0 
7 7 
0 m 0 
6,154,826 
U S .  Patent 
0 
0 - 
Nov. 28,2000 Sheet 5 of 65 
0 
0 
0 
0 
7 
6,154,826 
b m co 0, 
ni co 
co 
b 
00 
cd 
0 
7 
cj -
L L  
U S .  Patent 
100- 
% OF . 
PEAK 60- 
BAND- 4oI' 
WIDTH 
80-,.*' 
20--. 
0 
Nov. 28,2000 Sheet 6 of 65 
6- ;-; Zp.-m SMC 
# #:. ..... 
#@ ..... 
:.-**' 
NONSMC 
-. - . - . -. - . - . -. -. - . _---------- --- -. .................................... 
I I I I 1  
6,154,826 
80- 
% OF 
PEAK 60- 
BAND- 40:- 
WIDTH 2o 
- 
0 
- 1 BANK --- 4 BANKS 
--.- 2 BANKS ....... 8 BANKS 
SMC 
NONSMC --------------- 
.................................... 
I I I I I 
--------------- 
8 16 32 64128256 
FIFO DEPTH 
FIG. 6A 
1 
- 1 BANK --- 4 BANKS 
BANKS 
SMC 
% OF 
PEAK 
WIDTH 
BAND- 
NONSMC 
- - - - - - - - - - - - - - .  ................................... o f  I I I I 1 
8 16 32 64128256 
FIFO DEPTH 
FIG. 6C 
- 1 BANK - - - 4  BANKS 
- - - -  2 BANKS ....... 8 BANKS ~~ ~ ':$- -. - c.7 ........ SMC 
% OF 
PEAK 60[ I I I , 
WIDTH 
20 .-.-.-.-.-.-.-.-.-. 
BAND- 4o 
NONSMC 
--------------. ................................... 
0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 6E 
-1 BANK ---4 BANKS 
m - e - 2  BANKS -..-8 BANKS 
BAND- :;rl I I , I 
WIDTH NONSMC 
.-.-.-.-.-.-.-.-.-, 
--------------.  ................................... 
0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 6F 
U S .  Patent Nov. 28,2000 Sheet 7 of 65 
% OF PEAK 
BANDWIDTH 
-1 BANK ---4BANKS 
---2 BANKS -------8 BANKS 
............................................................ 
I .+- -I- 
6,154,826 
SMC 
11-11 NONSMC .-.-.-.-.-.-.-.-.-.-.-.-.-.-.-I- ........................................................... 
I I I I I 
8 16 32 64 128 256 
FIFO DEPTH 
FIG. 7 
U S  Patent 
100- 
% O F  80-e.4. 
PEAK 60- 
BAND- 4o 
WIDTH 
20: 
0 
Nov. 28,2000 Sheet 8 of 65 
*,.-; z--.s... SMC 100- ...... - 2.5.~ SMC - 7 ..... ....... 
, .-* 
# ..f’ ......... 
0° e** PEAK 60-p:....* .. , +* *.f‘ -:...* BAND- 40-’  
WIDTH NONSMC 
20--. -. - . - . -. -. - . -. - . - I 
I I I I I 0 ’  I I I I I 
NONSMC 
-- - ------ - -- - - - -. - - - - - - - - - - - - - - - I  .................................... ................................... 
6,154,826 
100- 
80- % OF 
PEAK 60- 
BAND- 4oi- 
WIDTH 
20-’ 
0 
- 1 BANK --- 4 BANKS 
- - a -  2 BANKS ------- 8 BANKS 
SMC .-.-.-.-.- - -.-.-.-. 
NONSMC --------------. 
................................... 
I I I I I 
-1 BANK - - -4  BANKS 
* - - -2  BANKS .......8 BANKS 
- 1 BANK --- 4 BANKS 
---- 2 BANKS -..---- a BANKS 
100 -._._.-.- SMC 
% OF 
PEAK 60 ...... 
WIDTH 
... 
NONSMC 
--------------. ................................... 
0: I I I I 1 
8 16 32 64128256 
FIFO DEPTH 
FIG. 8C 
- 1 BANK --- 4 BANKS 
---- 2 BANKS .---.-. 8 BANKS 
-1 BANK ---4 BANKS 
- - - - 2  BANKS ..... ..8 BANKS 
--------------- ................................... o r  I I I I 1  
8 16 32 64128256 
FIFO DEPTH 
FIG. 8E 
BAND- 4oik 
WIDTH NONSMC 
20 .-.-.-.-.-.-.-.-.-. 
................................... 
0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 8F 
U S .  Patent Nov. 28,2000 Sheet 9 of 65 6,154,826 
-1 BANK --- 4 BANKS 
.-.- 2 BANKS - - - - - - -  8 BANKS 
U S .  Patent Nov. 28,2000 Sheet 10 of 65 6,154,826 
-1 BANK ---4BANKS 
m - m - 2  BANKS ....... 8 BANKS 
- 1 BANK --- 4 BANKS 
.--- 2 BANKS ..----- 8 BANKS 
................................... 
0 1 '  I I I f 
8 16 32 64128256 
FIFO DEPTH 
FIG. 10A 
1 
% OF 
PEAK 
WIDTH 
BAND- 
-1 BANK ---4BANKS 
. - - -2  BANKS ....... 8 BANKS 
00 SMC 
.... 60 ... ........ 
........ 
20 += NONSMC 
---------------. ................................... 
0 
8 16 32 64 128256 
FIFO DEPTH 
FIG. 1OC 
BAND- 
WIDTH NONSMC 
.-.-.-.-.-.-.-.-.- --------------- 
8 16 32 64128256 
FIFO DEPTH 
FIG. 106 
-1 BANK ---4BANKS 
---- 2 BANKS ....... 8 BANKS 
100 I;.' SMC 
%OF "1 PEAK 60 
NONSMC --------------- 
WIDTH ................................... 
' O L  0 
8 16 32 64128256 
FIFO DEPTH 
FIG. IOD 
-1 BANK - - - 4  BANKS - 1 BANK --- 4 BANKS 
""2 BANKS .--.--- 8 BANKS "'-2 BANKS ....... 8 BANKS 
SMC 
% OF 
PEAK 60 PEAK 60 
WIDTH NONSMC WIDTH NONSMC 
--------------. 
8 16 32 64128256 8 16 32 64128256 
FIFO DEPTH FIFO DEPTH 
FIG. 10E FIG. IOF 
U S .  Patent Nov. 28,2000 Sheet 11 of 65 
-1 BANK --- 4 BANKS 
---- 2 BANKS ....... 8 BANKS 
% OF 
PEAK 60 
BAND- 4or I I I , 
WIDTH NONSMC 20 
0 
- - - - - - - - - - - - - -. ................................... 
8 16 32 64 128 256 
FIFO DEPTH 
FIG. 11A 
- 1 BANK --- 4 BANKS 
--.- 2 BANKS - - . . - - m  8 BANKS 
. ~. -. 'J 
..... 
L - - -. % OF l:y O--sMc .......... 
*f PEAK 60 .* 
BAND- 4o f:.+..;-' I I I ~ 
WIDTH 20 .-.-.-.-.-.-.-.-.- NONSMC --------------. ................................... 
0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 11C 
- 1 BANK --- 4 BANKS 
""2 BANKS m m m m ' ' .  8 BANKS 
-*-.  =./--r. .*.. -.--.- I00 
%OF PEAK 80[T-r 60 (* ...... 
WIDTH 
BAND- 4o . 
NONSMC 20 
0 
. - . - . - . - . - . - . - . - . - , --------------. ................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 11E 
6,154,826 
- 1 BANK --- 4 BANKS 
---- 2 BANKS ....... 8 BANKS 
...... 
% OF 
...... 
WIDTH NONSMC 
.-.-.-.-.-.-.-.-.- 
- - - - - - - - - - - - - - I  
8 16 32 64128256 
FIFO DEPTH 
FIG. 11B 
-1 BANK --- 4 BANKS 
- - - - 2  BANKS .-.---- 8 BANKS 
SMC 100 :a:-- .....k-- .L f e *.? 
% O F  8oy 
PEAK 60 
NONSMC - - - - - - - - - - -  
WlDTH ................................... 
BAND- 4o 
20 
0- 
8 16 32 64128256 
FIFO DEPTH 
FIG. 11D 
- 1 BANK --- 4 BANKS 
---- 2 BANKS -.*--.- 8 BANKS 
BAND- 4o $+..*'* 
WIDTH NONSMC .-.-.-.-.-.-.-.-.- -------------- 2o 1 ................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 11F 
U S .  Patent 
100 - 
80 % OF 
PEAK 60 
BAND- 4o - - 
WIDTH 
20 
0 
Nov. 28,2000 Sheet 12 of 65 
-. -. 'J ...... 9- ......... SMC .----..-:=:.S SMC 100 - 
- 6 4 -  .* )I- ....... 
-,0'.*... % OF -,e )0  *..*e 
-0 .) ... +..-."'. 
-/ PEAK 60 -,I' ......... 
80 - f i 0  ..i** 
i BAND- 4o .* 
-*.e' 
WIDTH NONSMC 
20 - -.-.-.-.-.-.-.-.-.-, 
I I I I I 0 I I I I 1  
NONSMC --.-._.-.-.-.-.-.-.-. -- -- --- - - - -- - - --. ---------------- .................................... .................................... 
6,154,826 
- 1 BANK --- 4 BANKS 
m - m - 2  BANKS ....... 8 BANKS 
100 
% OF 
................................... 
0 :  I I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 12A 
- 1 BANK --- 4 BANKS 
- - - - 2  BANKS ........ 8 BANKS 
...... 
WIDTH NONSMC 
.-.-.-.-.-.-.-.-.-, --------------. 
8 16 32 64128256 
FIFO DEPTH 
FIG. 12B 
- 1 BANK --- 4 BANKS 
--.- 2 BANKS .------ 8 BANKS 
-1 BANK --- 4 BANKS 
.---2 BANKS ....... 8 BANKS 
100 
% OF % OF 
PEAK 60 
BAND- 40 ..... 
WIDTH ..- 
..... ......... 
NONSMC - - - - - - - - - - - - - - I  
PEAK 60 .'* ,' 
0' 
.................................. NONSMC 
8 16 32 64128256 8 16 32 64128256 
FIFO DEPTH FIFO DEPTH 
FIG. 12C FIG. 12D 
-1 BANK --- 4 BANKS - 1 BANK --- 4 BANKS 
' - - - 2  BANKS - - - - m m -  8 BANKS --.-2 BANKS -- - - - - -  8 BANKS 
FIG. 12E FIG. 12F 
U S .  Patent 
100 - 
% OF 80 - 
. PEAK 60-' 
BAND- 40-* 
WIDTH 
0 
Nov. 28,2000 Sheet 13 of 65 
SMC 1001 -.-.-.-.-.-.-.-.-.- 
8o _. ................................. _----- -- - -- - -- -. 
-- ........... SMC % OF ..... .0 0 00 
00 
.** 
.... NONSMC --------------. PEAK 60- .... BAND- 40-- 
WDTH : .................................... 
20 
NONSMC 20----.-.-.-.-.-.-.-.- 
-,*.* 
-- --- - - - - - - - - - - - I  .................................... 
I I I I 1 0 I I I I I 
6,154,826 
- 1 BANK --- 4 BANKS 
---- 2 BANKS ....... 8 BANKS 
-.-.------- 100 %OF 8 0 r - . s M c  ..c------ 
PEAK 60 
BAND- 4ot'....:. I I I , 
WIDTH 
NONSMC 20 . - . - . - . - . - . - . -. -. - - - - - - - - - - - - - - - .  ................................... 
0 .  
8 16 32 64128256 
FIFO DEPTH 
FIG. 13A 
- 1 BANK --- 4 BANKS 
- - - -  2 BANKS ....... 8 BANKS 
- 1 BANK --- 4 BANKS 
- - - -  2 BANKS -.--.-- 8 BANKS 
1001 
BAND- , I I I 
WIDTH .-.-.-.-.-.-.-.-.- NONSMC 
20 
0 
- - - - - - - - - - - - - - . ................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 13B 
- 1 BANK --- 4 BANKS 
* - - -  2 BANKS ....... 8 BANKS 
- 1 BANK --- 4 BANKS 
"'-2 BANKS ....... 8 BANKS 
100 1 
................................... 
0 1  I I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 13E 
- 1 BANK --- 4 BANKS 
---- 2 BANKS - - m - m m m  8 BANKS 
100 7 
BAND- 4o ....... 
NONSMC WIDTH f *  I I I , .-.-.-.-.-.-.-.-.- 20 - - - - - - - - - - - - - - .  
................................... 
0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 13F 
U S .  Patent 
00 - 
80 
60- 
40 
20: 
Nov. 28,2000 Sheet 14 of 65 
C.---.-----.' 
* *  .e-------. -5 ................... 
0 ..f'* ,,' .** 
i ' ..e* -..--* 
-.-.-. - .-.-.-.-.-._ 
- - - - - - - - - - - - - - I  ................................... 
1 
% OF 
PEAK 
WIDTH 
BAND- 
- 1 BANK --- 4 BANKS 
.--- 2 BANKS ....... 8 BANKS 
SMC 
NONSMC 
- 1 BANK --- 4 BANKS 
---- 2 BANKS ---.-.- 8 BANKS 
100, 
.e--.- 
%OF 
PEAK 60 
p.. --- - SMC ........... ..... ' ...... BAND- 4o 
WIDTH -** 
.f. 
NONSMC 20 .-.-.-.-.-.-.-.-.- 
- - - - - - - - - - - - - - I  ................................... o r  I I I I 1  
8 16 32 64128256 
FIFO DEPTH 
FIG. 14C 
6,154,826 
- 1 BANK --- 4 BANKS 
.-.- 2 BANKS ....... a BANKS 
1001 
SMC % OF ........ 
PEAK 60 
BAND- 4oy. 
WIDTH .-.-.---.-.-.-.-.- NONSMC 
8 16 32 64128256 
FIFO DEPTH 
FIG. 14B 
- 1 BANK ' - - -  4 BANKS 
---- 2 BANKS ....... 8 BANKS 
% 1 o o r  ............................... SMC 
PEAK 60 
80 .-.- 
NONSMC 
WIDTH ................................... 2oL 0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 14D 
- 1 BANK --- 4 BANKS - 1 BANK --- 4 BANKS 
---- 2 BANKS ....... 8 BANKS ---- 2 BANKS ....... 8 BANKS 
100 1 1001 
................................... 
o t  I I I I 1 
8 16 32 64128256 
FIFO DEPTH 
FIG. 14E 
60 ....... 
% OF 
PEAK SMC 
NONSMC 
8 16 32 64128256 
FIFO DEPTH 
FIG. 14F 
U S  Patent 
100 - 
80 - - % OF 
PEAK 60 -' 
BAND- 40 
20--.- 
WIDTH 
0 
Nov. 28,2000 Sheet 15 of 65 
_._._.-.-._.-.-.-.- c--- - ----  SMC 100 
e*- .......................... 
C.C.---- 80-' .*a 
.-* - +.f  .@* *e--- ----- SMC % OF ......................... 
NONSMC - - - - - - - - - - - - - - .  
PEAK 60-' 
@ @  
,,@ :" .... *.** BAND- 40-- 
.-.-.-.-.-.-.-.-.- *.-* NONSMC WIDTH .................................. 
20 .-.- .- .-.-.-.-.- -- - - -- -- - - - - - -- -, .................................... 
I I I I I 0 I I I I I 
-1 BANK - - - 4  BANKS 
m - m - 2  BANKS .-..--- 8 BANKS 
BAND- 4o 
20 
0 
WIDTH 
NONSMC -. -. -. - . - . - . - . -. - - - 
................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 15A 
- 1 BANK --- 4 BANKS 
---- 2 BANKS ..----- 8 BANKS 
6,154,826 
- 1 BANK --- 4 BANKS 
m - m - 2  BANKS ....... 8 BANKS 
100 1 
- - - - - - - - - - - - - - .  
8 16 32 64128256 
FIFO DEPTH 
FIG. 15B 
- 1 BANK --- 4 BANKS 
m - q - 2  BANKS -.----- 8 BANKS 
- 1 BANK --- 4 BANKS - 1 BANK --- 4 BANKS 
'-"2 BANKS .------ 8 BANKS 
100 1 
................................... 
o t  1 I I I 1 
8 16 32 64128256 
FIFO DEPTH 
FIG. 15E 
""2 BANKS ....... 8 BANKS 
100 1 
SMC ...... "..?*- - - . 
NONSMC 
% OF 
@@ PEAK 60 / H  ..... ........ 
BAND- 4o ..... .** .- 
.-.-.-.-.-.-.-.-.- WIDTH 
20 - - - - - - - - - - - - - -. ................................... 
0 1  I I I I 1  
8 16 32 64128256 
FIFO DEPTH 
FIG. 15F 
U S .  Patent 
1 0 0  - 
% O F  80: 
PEAK 60 
BAND- 40 - 
WIDTH 
20 
0 
Nov. 28,2000 
y - .  -. -.-. -. - - -. -. -. - 
--------------. 
c ................................ 
I I I I I 
-------------- 
WIDTH BAND- 404- 
100 - 
80 - % OF 
PEAK 
WIDTH 
BAND- 40- 
20 
0 
................................... 
8 16 32 6 4 1 2 8 2 5 6  
FIFO DEPTH 
FIG. 16A 
1 0 0 -  
80 - 
Go--.= PEAK 60- 
- -.-.-.---.-.-.-.-.- % OF 
- - - - - - - - - - - - - - - - I  
L - - - - - - - - - - - -  BAND- 40- 
20 - 
................................... WIDTH .............................. 
-..** 
I I I I I 0 I I I I I 
Sheet 16 of 65 
1 0 0  - 
80 - % OF 
PEAK 60- 
WIDTH 
BAND- 4o 
20 -- 
0 
6,154,826 
.-.-.-.-.-.-.-.-.- -- 
- - - - - - - - - - - - - - - I  
................................... 
I I I I 1 
100 - 
80 - % OF 
PEAK 60-- 
BAND- 4o 
WIDTH 
20 
0 
.-.-.-.-.-.-.-.-.- -- 
(e#-----------. 
c ................................. 
- * -  
I I I I 1 
U S .  Patent 
100- 
% O F  801 
PEAK 
BAND- 40-  
2o 
0 
WIDTH 
- 
Nov. 28,2000 Sheet 17 of 65 
60--.-.- .-.-. -.- .-.-__ 
-- - - - - - - - - - - - - - - .  
.................................... 
I I I I I 
6,154,826 
8G % OF 
PEAK 60- 
WIDTH 
BAND- 40-- 
20- 
0 
-.-.-.-.-.-.-.-.-.- 
- - - - - - - - - - - - - - .  
.................................... 
I I I I I 
- 1 BANK --- 4 BANKS 
---- 2 BANKS .-----. 8 BANKS 
l o o l  
100- 
% OF 80- 
PEAK 60- 
BAND- 4o - 
WIDTH 
20 
0 
% O F  8oj 
_-.-.-.-.-.-.-.-.-._ 
- - -- -- - -- - - - - - - I  
.................................. -.. 
I I I I 1 
PEAK 60- 
100- 
80- % OF 
PEAK 60-- 
WIDTH 
BAND- 4o - 
20 -- 
0 
............................... 
8 16 32 64 128 256 
FIFO DEPTH 
FIG. 17C 
---- .- .-. - .-. -. -. -. - 
~ - - - - - - - - - - - - - 
c ................................... 
I I I I I 
% OF 
PEAK 
WIDTH 
BAND- 
- 1 BANK _ _ _  4 BANKS 
.-.- 2 BANKS ....... 8 BANKS 
1001 
................................. 
20f 
0‘ 
8 16 32 64128256 
FIFO DEPTH 
FIG. 17B 
U S .  Patent 
100 - 
%OF 
PEAK 60-\. 
BAND- 4o - 
2o 
0 
WIDTH 
-. 
Nov. 28,2000 Sheet 18 of 65 
80:\ 
'. C . - . - . - . - . - - - . - .  
---------------.  
................................... 
I I I I 1 
6,154,826 
100- 
%OF 801 
PEAK so-: 
BAND- 401 
WIDTH 
20 
0 
.C . - . - . - . - . - . - . - . -  
c------------, 
............................... 
*c 
:....a 
I I I I 1 
100 - 
80 - % OF 
PEAK 
WIDTH 
BAND- 4oi- 
2o 
0 
_. 
6 0 - 1  
--.-.-.-.-.-.-.-._, 
c ................................... 
I I I I I 
100- 
80-- % OF 
PEAK 60- 
BAND- 4oi- 
WIDTH 
20- 
0 
-.-.-.-.-.-.-.-.-.-. 
--------------- 
................................. .... 
I I I I i 
100- 
80 - % OF 
PEAK 
BAND- 4o - 
20 - 
0 
WIDTH 
6 0 - 1  -----.-.-.-.-.-.-.-, 
__-------------- 
................................... 
I I I I I 
100- 
80 - % OF 
PEAK 
BAND- 4o - 
WIDTH 2o 
- 
0 
.-.-.-.-.-.-.-.-.- 60-- -- 
C - - - - - - - - - - - - .  
_ C  ................................... 
I I I I I 
U S .  Patent Nov. 28,2000 Sheet 19 of 65 
100 
% OF 80 
PEAK 60 
WIDTH 
BAND- 4o 
20 
- 1 BANK --- 4 BANKS 
.--- 2 BANKS - - - - - - -  8 BANKS 
SMC 
NONSMC .-.-.-.-.-.-.-.-.- - - - - - - - - - - - - - - .  .................................. 
I I I I 1 
8 16 32 64128256 
FIFO DEPTH 
FIG. 19A 
6,154,826 
- 1 BANK --- 4 BANKS 
---- 2 BANKS ------- 8 BANKS 
1001 
%OF 8 o p  #.## .......... ........... ---_ SMC 
PEAK 60 
-------- - - - - - - .  
8 16 32 64 128 256 
FIFO DEPTH 
FIG. 19B 
% OF 
PEAK 
% OF 
PEAK 60 
WIDTH NONSMC WIDTH 
..... 
*.-. BAND- .... 
- - - - - - - - - - - - - - .  ................................... o t  I I I I 1 
8 16 32 64 128256 
FIFO DEPTH 
FIG. 19C 
- 1 BANK --- 4 BANKS 
* - a -  2 BANKS ....... 8 BANKS 
100 1 
................................... 
0 1  I I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 19E 
1 
% OF 
PEAK 
WIDTH 
BAND- 
80 .................................. 
................................... 
20 L N O N S M C  
8 16 32 64128256 
FIFO DEPTH 
FIG. 19D 
- 1 BANK --- 4 BANKS 
e - - -  2 BANKS ....... 8 BANKS 
...... .... 
................................... 
I I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 19F 
SMC 
NONSMC 
U S .  Patent 
100- 
80- % OF 
PEAK 60- 
BAND- 4oi 
20-* 
0 
WIDTH 
Nov. 28,2000 Sheet 20 of 65 6,154,826 
-.-.-.-.-.-.-.-.-.- 
*- - --  -- - -- -- -. 
0 ................................... 
I I I I I 
WIDTH BAND- 40+ 
100- 
80- % OF 
PEAK 60- 
WIDTH 
BAND- 40-- 
20- 
0 
................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 20A 
--.-.-.-.-.-.-.-.-.- 
- - - - - - - - - - - - - - .  
............................... ...... 
I I I I I 
- 1 BANK --- 4 BANKS 
---- 2 BANKS ....... 8 BANKS 
loo] 
100- 
%OF 80T 
PEAK 
WIDTH 
BAND- 4o 
20 -I 
0 
% O F  8od 
.-.-.-.-.-.-.-.-.- 60- -- 
- - - - - - - - - - - - - - - I  
................................... 
I I I I I 
PEAK 60 .-.-.-.-.-.-.-.-.- 
................................ WIDTH 
8 16 32 64128256 
FIFO DEPTH 
FIG. 20C 
100- 
80 - % OF 
PEAK 
BAND- 4o 
WIDTH 
20 
0 
.-.-.-.-.-.-.-.-.- 60-- - -  
#(------------. 
* .............................. 
-,...-* 
I I I I 1 
U S  Patent 
100- 
% OF - 
PEAK 
BAND- 4oi* 
WIDTH 
0 
Nov. 28,2000 Sheet 21 of 65 
SMC 100 b .-*.-.-.-.-.-.  
- 2' 
SMC 
80-*' (.). #-e-C*- ,  - y.r.7.: - - --  % OF 80-/ 
60-,*0*,,0C ..... .** PEAK 60- 
# *  0' **.-= 
@' ....... ...... BAND- 4oi- NONSMC - - - - - - - - - - - - - - .  
2o NONSMC 20--.-.-.-.-.-.-.-.-.- - 
WIDTH .................................... 
-- -- - - - - - - ---- - -, .................................... 
I I I I I 0 I I I I I 
- 1 BANK --- 4 BANKS 
.--- 2 BANKS - * - - - - -  8 BANKS '"".r- *,* ,/ ......... T... SMC 
% OF 
PEAK 60 ' f. 
WIDTH 
NONSMC .-.-.-.-.-.-.-.-.- 20 
8 16 32 64128256 
FIFO DEPTH 
FIG. 21A 
6,154,826 
- 1 BANK --- 4 BANKS 
.--- 2 BANKS --.---- 8 BANKS 
BAND- 40$/'- 
WIDTH .-.-.-.-.-.-.-.-.- NONSMC 
- - - - - - - - - - - - - - .  
8 16 32 64128256 
FIFO DEPTH 
FIG. 218 
- 1 BANK --- 4 BANKS 
---- 2 BANKS -.--.-- 8 BANKS 
- 1 BANK --- 4 BANKS 
---- 2 BANKS --..-.. 8 BANKS 
.-.-.-.-.-.-.-.-.- 
NONSMC 
................................... 
WIDTH 
8 16 32 64128256 
FIFO DEPTH 
FIG. 21E 
BAND- 4o $..* 
WIDTH NONSMC .-.-.-.-.-.-.-.-.- - - - - - - - - - - - - - - .  
................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 21F 
U S .  Patent 
80- % OF 
PEAK 60- 
WIDTH 
BAND- 4o - 
2o 
0 
- 
Nov. 28,2000 Sheet 22 of 65 
SMC - - - - - - - - - - - - - - - .  
................................ 
*.- 
NONSMC -- -- - - - - - - - - - - - -. 
.................................... 
I I I I I 
1 
% OF 
PEAK 
WIDTH 
BAND- 
- 1 BANK _ _ _  4 BANKS 
.-.- 2 BANKS ....... 8 BANKS 
00 
80 
60 
40 
-.-.-.-.-.-.-.-.-.- 
- - - - - - - - - - - - - - - I  
8 16 32 64128256 
- .e.-.-.. r- .,**,e0- *.-* ................ 
.** 
.*. 
*.* 
FIFO DEPTH 
FIG. 22A 
SMC 
NONSMC 
- I BANK --- 4 BANKS 
- - - -  2 BANKS ....... 8 BANKS 
100 7 
.......... PEAK 60 
..... 
NONSMC WIDTH 
................................... 
0 1  I I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 22C 
- 1 BANK --- 4 BANKS 
.--- 2 BANKS ....... 8 BANKS 
100 1 
% OF 
PEAK 
WIDTH 
BAND- 
................ ............ 
........ 
.-.-.-.-.-.-.-.-.-. 
20 ................................... kNoNSMc 0 8 16 32 64128256 
FIFO DEPTH 
FIG. 22E 
6,154,826 
- 1 BANK _-_ 4 BANKS 
.-.- 2 BANKS ....... 8 BANKS 
1001 
%OF 8op .---. 
.......... ----. SMC PEAK 60 .......... , 
BAND- 40$.+,.-***** 
WIDTH .-.-.-.-.-.-.-.-.- NONSMC 
--------------.  
8 16 32 64128256 
FIFO DEPTH 
FIG. 22B 
- 1 BANK --- 4 BANKS 
--.- 2 BANKS ....... 8 BANKS 
1007 
% OF 
PEAK 
WIDTH 
BAND- 
:{*NoNsMc .-.-.-.-.-.-.-.-.-. 
................................... 
0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 22F 
U S .  Patent 
100- 
80 - % OF 
PEAK eo-- 
BAND- 4 0 -  - 
WIDTH 
20:....*. 
0 
Nov. 28,2000 Sheet 23 of 65 
---.-.-.-.-.-.-.-.- 
-------------. - .............................. 
I I I I 1 
6,154,826 
100 - 
80 - % OF 
PEAK 60-’ 
BAND- 40- 
WIDTH 2o 
0 
-. 
................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 23A 
-.-.-.-.-.-.-.-.-.- 
_ L _ _ - - - _ _ _ _ _ _ _ .  - 
................................... 
I I I I 1 
100- 
80- % OF 
PEAK 60- 
WIDTH 
BAND- 4oi 
20- 
0 
-.-.-.-.-.-.-.-.-.- 
- - - - - - - - - - - - - - - .  
............................... .... 
I I I I I 
100 - 
80 - 
PEAK 60- 
% OF 
BAND- 4o - - 
2o 
0 
- 
-.-.-.-.-.-.-.-.-.- 
- - - - - -- - - - - - - - -, 
.................................... 
I I I I 1 
100 - 
80 - % OF 
PEAK 60-- 
BAND- 4o 
20 
0 
- 
WIDTH 
-.-.-.-.-.-....-.-.- 
- - - - - - - - - - - - - - - .  
................................ --.-. 
I I I I I 
U S .  Patent Nov. 28,2000 Sheet 24 of 65 
- 1 BANK _-- 4 BANKS 
. - - -2  BANKS ....... 8 BANKS 
O0 r~ . . 0 5  0 ........ SMC 
% O F  8o ..J 
0- *.-- PEAK 604 ,' -.a- 
BAND- 4o $'- 
WIDTH 
20 NONSMC - - - - - - - - - - - - - - - .  ................................... 
0 1  I I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 24A 
-1 BANK --- 4 BANKS 
m - m - 2  BANKS -.----- 8 BANKS 
BAND- 40 1::. - ..... ** 
NONSMC 20 -.-.-.-.-.-.-.-.-.- WIDTH -------- - - - - - - .  ................................... o r  I I I I 1  
8 16 32 64128256 
FIFO DEPTH 
FIG. 24C 
-1 BANK ---4 BANKS 
- - - - 2  BANKS -..---- 8 BANKS 
6,154,826 
- 1 BANK --- 4 BANKS 
m - m - 2  BANKS ....... 8 BANKS 
SMC 
% OF ..... 
PEAK 60 .... 
-.-.-.-.-.-.-.-.-.- 
8 16 32 64128256 
FIFO DEPTH 
FIG. 24B 
- I BANK --- 4 BANKS 
-2 BANKS .--.-.. a BANKS 
SMC 
PEAK 60 
WIDTH 
NONSMC 
................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 24D 
-1 BANK --- 4 BANKS 
m - n - 2  BANKS ....... 8 BANKS 
- - - - - - - - - - - - - - I  ................................... 
I I I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 24E 
WIDTH NONSMC 
................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 24F 
U S .  Patent 
100 - 
80 
BAND- 4o 
% OF 
PEAK 60-' 
WIDTH 
20 - 
0 
Nov. 28,2000 Sheet 25 of 65 
SMC .e.-. -. -.--- -< * . * 0 * - - - - - - -  ................ 
" . ......... . 
/ **.' 
-' ..... -,....- 
NONSMC 
-.-.-.-.-.-.-.-.-.-. 
- - - - - - - - - - - - - - - I  .................................... 
I I I I I 
6,154,826 
- 1 BANK _ _ _  4 BANKS 
.-.- 2 BANKS ....... 8 BANKS 
1001 
%OF 80- # .............. - tr-,, -.-.,   - SMC 
.......... PEAK 60 
BAND- ;:[-***: I I I , 
NONSMC WIDTH .-.-.-.-.-.-.-.-.- 
- - - - - - - - - - -  ................................... 
0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 25B 
- 1 BANK --- 4 BANKS - 1 BANK --- 4 BANKS 
. - * -  2 BANKS .--.--. 8 BANKS ---- 2 BANKS ....... 8 BANKS 
100 
%OF "PSMC % O F  l ~ ~ ~ s M c  80 ................................... 
PEAK 60 PEAK 60 
WIDTH 
NONSMC - - - - - - - - - - -  BAND- 40 .... **.-. BAND- .... 
2o WIDTH NONSMC 20 .-.-.-.-.-.-.-.-.-. ................................... 
................................... o t  I I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 25C 
01  I I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 25D 
- 1 BANK --- 4 BANKS - 1 BANK --- 4 BANKS 
---- 2 BANKS ..--.-- 8 BANKS ---- 2 BANKS -- - - - - -  8 BANKS 
loo 1 100 
% OF 8 o F - - L ~ ~ ~  - ...................... - - --< - &.-..  ---  
PEAK 60 
WIDTH 
....... BAND- ;oT!i:-;;.-.;.-;-.-! 
NONSMC 
................................... 
0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 25E 
....... SMC 
NONSMC .-.-.-.-.-.-.-.-.-. WIDTH 20 
................................... 
0 7  1 I I I 1 
8 16 32 64128256 
FIFO DEPTH 
FIG. 25F 
U S .  Patent 
100- 
80 - % OF 
PEAK 60- 
BAND- 4o - 
2o 
0 
WIDTH 
- 
Nov. 28,2000 Sheet 26 of 65 6,154,826 
-.-.-.-.-.-.-.-._.- 
- - - - - - - - - - - - - - - - .  
.................................... 
I I I I l 
100- 
%OF *O: 
PEAK 60- 
BAND- 40-  
WIDTH 
- 
20- 
0 
-.-.-.-.-.-.-.-.-.- 
c .................................. 
I I I I I 
80 - 
% OF 
PEAK 60- 
BAND- 40- 
2o 
0 
WIDTH 
- 
.+.-.-.-.-.-.-.-.-, 
c 
e#- ------------ 
* ................................... 
I I I I I 
100- 
80- % OF 
PEAK 6o 
BAND- 4 0 -  
WIDTH 
20- 
0 
- -.-. - .-.-.-.-.-.-.- 
.................................... 
I I I I 1 
100 - 
80 - 
PEAK 60- 
BAND- 4o 
2o 
0 
% OF 
- 
WIDTH 
-. 
-.-.-.-.-.-.-.-.-.- 
-- - -- - - - - -- - - - ,  
................................... 
I I I I 1 
100- 
80 - % OF 
PEAK 60- 
WIDTH 
BAND- 4o 
2o 
0 
- 
- 
-.-.-.-.-.-.-.-.-.- 
-e ................................... 
I I I I 1 
U S .  Patent 
80 - % OF 
PEAK 60- 
BAND- 4oi 
WIDTH 2o 
0 
- 
Nov. 28,2000 Sheet 27 of 65 
.... 
---------------.NONSMC 
.................................... 
I I I I i 
- 1 BANK - _ _  4 BANKS 
.-.-2 BANKS ....... 8 BANKS 
PEAK 60 iOO':- .e*- I I , 
BAND- 40 ... ..... 
WIDTH 
NONSMC 
- - - - - - - - - - - - - - - .  ................................... 
20 
0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 27A 
-1 BANK - - - 4  BANKS 
- - - - 2  BANKS ....... 8 BANKS 
- - - - - - - - - - - - - - - .  ................................... o t  I I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 27C 
-1 BANK ---4 BANKS 
m - m - 2  BANKS ....... 8 BANKS 
6,154,826 
-1 BANK _ _ _  4 BANKS 
.-.-2 BANKS ....... 8 BANKS 
BAND- 4ov 
WIDTH NONSMC 
-.-.-.-.-.-.-.-.-.- --------- - - - - - - .  
8 16 32 64128256 
FIFO DEPTH 
FIG. 27B 
-1 BANK - - - 4  BANKS 
m - m - 2  BANKS ....... 8 BANKS 
WIDTH i=,NONSMC 
.-.-.-.-.-.-.-.-.- 
- - - - - - - - - - - - - - - I  ................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 27E 
BAND- 4o ]...*' 
WIDTH NONSMC 
-.-.-.-.-.-.-.-.-.- 
8 16 32 64128256 
FIFO DEPTH 
FIG. 27F 
U S .  Patent 
40-, 
20 - 
Nov. 28,2000 Sheet 28 of 65 
60-,0' e.++* 
' .e. ....... 
-.-.-.-.-.-.-.-.-.- - - - - - - - - - - - - - - - .  .................................... 
I I I I 1 
1 
% OF 
PEAK 
WIDTH 
BAND- 
- 1 BANK _ _ _  4 BANKS 
.-.- 2 BANKS ....... 8 BANKS 
SMC 
NONSMC 
- I BANK --- 4 BANKS 
- - -  2 BANKS -.--.-- 8 BANKS 
100, 
PEAK 60 
0 ..... ..... 
NONSMC 
- - - - - - - - - - - - - - - .  ................................... 
0 1  I I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 28C 
*c-- - *  8og \ ................ --=e- -. ......... SMC 
60 ..... 
6,154,826 
- 1 BANK _ _ _  4 BANKS 
.-.- 2 BANKS ....... 8 BANKS 
100, 
- - - - - - - - - - - - - - - I  
8 16 32 64128256 
FIFO DEPTH 
FIG. 28B 
- 1 BANK --- 4 BANKS 
---- 2 BANKS --*.-.. 8 BANKS 
SMC 
................................. 
% OF 
PEAK 60 
NONSMC 
WIDTH BAND- 4 0 1  ................................... 2oL 0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 28D 
- 1 BANK --- 4 BANKS - 1 BANK --- 4 BANKS 
.--- 2 BANKS .*-.--- 8 BANKS - - a -  2 BANKS -.-.-.. 8 BANKS 
100, 100, 
% OF 
PEAK 
WIDTH 
BAND- 40{- -.-.-.-.-.-.-.-.-.- I , , I ,NoNsMc 
20 - - - - - - - - - - - - - - - .  
................................... 
0 
8 16 32 64 128 256 
FIFO DEPTH 
FIG. 28E 
PEAK 60 .... 
WIDTH 
..... ..... ..... NONSMC 
20 - 1-L. Il=-ZL ,-I:= L-L 1. ................................... 
0 F8 16 32 64128256 
FIFO DEPTH 
FIG. 28F 
U S .  Patent Nov. 28,2000 Sheet 29 of 65 6,154,826 
100- 
80- % OF 
PEAK eo-- 
BAND- 40- 
WIDTH 
20-‘”” 
0 
................................... 2oL 0 8 16 32 64128256 
FIFO DEPTH 
FIG. 29A 
-.-.-.-.-.-.-.-.-.- -- - - - - - - - - - - - -. 
c ............................... 
I I I I I 
100- 
80- % OF 
PEAK 60- 
BAND- 40-- 
WIDTH 
20- 
-.-.-.-.-.-.-.-.-.- 
- - - - - - - - - - - - - - .  
.................................... 
80 - 
% OF 
PEAK 60-- 
WIDTH 
BAND- 40- 
20 
0- 0- 
8 16 32 64128256 8 16 32 64 128 256 
FIFO DEPTH FIFO DEPTH 
FIG. 29C FIG. 29D 
.c.-.-.-.-.-.-.-.- 
c -------_ 
# ................................. -.-. 
100 - 
80 - % OF 
PEAK 60-- 
BAND- 4o - 
WIDTH 
- 
20 
0 
-.-.-.-.-.---.-.-.- 
*#------------. 
* ............................... _...-- 
I I I I 1 
100 7 
80 - % OF 
PEAK 60- 
BAND- 4o - 
W,DTH 
2o 
0 
- 
-.-.-.-.-.-.-.-.-.- 
- - - - - - - - - - - - - - -. 
................................... 
I I I I I 
U S .  Patent 
% OF 
PEAK 
BAND- 4o - 
WIDTH 
20 - 
Nov. 28,2000 Sheet 30 of 65 
O.--+-i--.~.~.s; SMC 
80-<4*:.,.-*** #.e.  0 ...... 
60:'"' ........ .... -'.....- 
_'*  
NONSMC 
1 
% OF 
PEAK 
WIDTH 
BAND- 
- 1 BANK --- 4 BANKS 
--.- 2 BANKS ....... 8 BANKS 
00 
60 
40 ....... 
- - - - - - - - - - - - - - - .  
8 16 32 64128256 
FIFO DEPTH 
FIG. 30C 
- 1 BANK --- 4 BANKS 
---- 2 BANKS .------ 8 BANKS 
6,154,826 
- 1 BANK _ _ _  4 BANKS 
.-.- 2 BANKS #....... 8 BANKS 
WIDTH NONSMC 
8 16 32 64128256 
FIFO DEPTH 
FIG. 30B 
- l B A N K  --- 4BANKS 
.--- 2 BANKS ........ 8 BANKS 
1 0OF -.*. SMC 
%OF 
8oJ PEAK 60 
................................... ,OL-- 0 8 16 32 64128256 
FIFO DEPTH 
FIG. 30D 
- 1 BANK --- 4 BANKS 
- - e -  2 BANKS -- - - - - -  8 BANKS 
................................... 
I I I I 1 
8 16 32 64128256 
FIFO DEPTH 
FIG. 30E 
BAND- 
WIDTH 
I I I , 0 ................................... 
NONSMC 
20 - . - . - . - . - . - . - . -. -. - NONSMC 
8 16 32 64128256 
FIFO DEPTH 
FIG. 30F 
WIDTH 
U S .  Patent 
100 - 
80 
PEAK 60- 
BAND- 4o 
% OF 
- 
WIDTH 
20 - 
Nov. 28,2000 Sheet 31 of 65 
SMC e--- - - -  -. - 504- - - - - -  
............... 
-,* 0 ..a* 
0,0 .*e -‘ .-.’* 
...... 
8 .... 
NONSMC -.-.-.-.-.-.-.-.-.- 
1 
% OF 
PEAK 
WIDTH 
BAND- 
00 
80 
60 
40 
20 - - - - - - - - - - - - - - .  ................................... o r  I I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 31C 
-I BANK - - - 4  
._.- 2 BANKS ....... 8 
7 
BANKS 
BANKS 
SMC 
6,154,826 
- 1 BANK _ _ _  4 BANKS 
.-.- 2 BANKS ....... 8 BANKS 
1001 
SMC % OF 
PEAK 60 
BAND- 4oy 
NONSMC WIDTH .-.-.-.-.-.-.-.-.- 
8 16 32 64128256 
FIFO DEPTH 
FIG. 31B 
- 1 BANK --- 4 BANKS 
---- 2 BANKS ..-..-- a BANKS 
............................... 
% OF ..... 
PEAK 60 
................................... 2oL 0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 31D 
- 1 BANK --- 4 BANKS - 1 BANK --- 4 BANKS 
.--- 2 BANKS ....... 8 BANKS - - a -  2 BANKS -- - - - - -  8 BANKS 
100 7 1007 
_.e.-. 
%OF 8oc-L - e-- - -2 ---- * - .  . SMC 
PEAK 60 ..... ....................... ...... 
40 -.-.-.-.-.-.-.-.-.- 
2o NONSMC 
BAND- 
WIDTH 
................................... 
0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 31E 
.-.-.-.-.-.-.-.-.- 
- - - - - - - - - - - - - - I  
8 16 32 64128256 
FIFO DEPTH 
FIG. 31F 
U S .  Patent Nov. 28,2000 Sheet 32 of 65 6,154,826 
100- 
% O F  80: 
PEAK 60- 
BAND- 4 0 -  - 
WIDTH 
20 
0 
- 1 BANK --- 4 BANKS 
m - a - 2  BANKS .------ 8 BANKS 
-.-.-.-.-.-.-.-.-.- 
e------------- 
e ................................ 
-,..e* 
I I I I I 
-------------- 
WIDTH BAND- 4 0 f  
100- 
80 - % OF 
PEAK 60- 
WIDTH 
BAND- 4oi- 
20 - 
0 
................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 32A 
--.-.-.-.-.-.-.-.-.- 
--------- - - - - - .  
............................... ..... 
I I I I 1 
- 1 BANK --- 4 BANKS 
---- 2 BANKS ....... 8 BANKS 
100 
% OF 
100 - 
80 - % OF 
PEAK 
WIDTH 
BAND- 4o 
2o 
0 
- 
-. 
.C . - . - . - . - . - . - . - . -  BAND- PEAK 4 0 f z  60 
WIDTH .............................. 20 ..*.. 
.-.-.-.-.-.-.-.-.- 60- -- 
--------- - - - - - - .  
................................... 
I I I I I 
0- 
8 16 32 64128256 
FIFO DEPTH 
FIG. 32C 
80 - % OF 
PEAK EO-. 
BAND- 4o 
WIDTH 
20 
0 
- 
.-.-.-.-.-.-.-.-.- -- 
e------------. 
.............................. *c -,...*. 
I I I I I 
U S .  Patent 
80 - % OF 
PEAK 60- 
WIDTH 
BAND- 4oi 
2o 
0 
- 
Nov. 28,2000 Sheet 33 of 65 
NONSMC - - - - - - - - - - - - - - - .  
.................................... 
I I I I I 
6,154,826 
- 1 BANK _ _ _  4 BANKS 
- - . -2  BANKS ....... 8 BANKS 
- 1 BANK _ _ _  4 BANKS 
.-.-2 BANKS ....... 8 BANKS 
FIFO DEPTH 
FIG. 33A 
- 1 BANK --- 4 BANKS 
m - m - 2  BANKS ..----- 8 BANKS 
NONSMC WIDTH 
---------------. ................................... 
0 )  I I I I 1  
8 16 32 64 128 256 
FIFO DEPTH 
FIG. 33C 
- 1 BANK --- 4 BANKS 
- - - - 2  BANKS ------. 8 BANKS 
% OF 
() ................................... 
I I I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 33E 
FIFO DEPTH 
FIG. 33B 
- 1 BANK --- 4 BANKS 
s - m - 2  BANKS - * - - - - -  8 BANKS 
BAND- 4o ]....-*"* 
WIDTH NONSMC 
- - - - - - - - - - - - - - - I  ................................... 
I I I I I 1 
8 16 32 64128256 
FIFO DEPTH 
FIG. 33F 
U S .  Patent Nov. 28,2000 Sheet 34 of 65 
1 
% OF 
PEAK 
WIDTH 
BAND- 
-1BANK --- 4BANKS 
---- 2 BANKS .---.-- 8 BANKS 
*.e.-. 
.. )# ............ 
00 8 0 p  SMC 
......... , 
-.f 60 ......... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 34A 
6,154,826 
- 1 BANK --- 4 BANKS 
- - - -  2 BANKS ....... 8 BANKS 
1001 
% OF 80- e c - ....... - ;x  SMC .+..--. ........ I PEAK 60 ** _. 
BAND- ;{ ..... *,*- I , I , 
WIDTH - . - . - . - . - . - . - . -. - . - NONSMC 
20 
0 
- - - - - - - - - - - - - -. ................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 348 
- 1 BANK --- 4 BANKS - 1 BANK --- 4 BANKS 
---- 2 BANKS ....... 8 BANKS ---- 2 BANKS ....... 8 BANKS 
100 - .-.-.-.-.-.-.-.-.- SMC 100 
8o ................................... 
%OF % O F  
PEAK 60 PEAK 60 
---'------SMC ................ 
NONSMC - - - - - - - - - - - - - - .  BAND- 40 ...... *.f. BAND- 4o 
2o WIDTH .*+ WIDTH NONSMC 20 -.-.-.-.-.-.-.-.-.- ................................... 
- - - - - - - - - - - - - - - .  ................................... 
0 
8 16 32 64128256 8 16 32 64128256 
FIFO DEPTH FIFO DEPTH 
FIG. 34C FIG. 34D 
- 1 BANK --- 4 BANKS - 1 BANK --- 4 BANKS 
.--- 2 BANKS .* - - - - -  8 BANKS ---- 2 BANKS --.-.-- 8 BANKS 
100 1001 
.................. 
PEAK 60 ..... 
WIDTH 
d. - . - . - . - . - . - . - * - . -  
NONSMC 
................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 34E 
4 
PEAK 60 .... SMC ..... 
NONSMC 
..... 
% OF 
WIDTH 
................................... 
o ?  I I I I 1 
8 16 32 64128256 
FIFO DEPTH 
FIG. 34F 
U S .  Patent 
100 - 
80 - % OF 
PEAK 
WIDTH 
BAND- 4oI 
20 - 
0 
Nov. 28,2000 Sheet 35 of 65 6,154,826 
\.-.-.-.-.-.-.-. 6 0 - b  
---------------.  
.................................... 
I I I I I 
100- 
80 - % OF 
PEAK 60 
WIDTH 
BAND- 40- 
20 
0 
-_ .- . - . - . - . - . - . - . - . - 
* .............................. .... -.. 
I I I I I 
-1 BANK --- 4 BANKS 
""2 BANKS . - - - * - -  8 BANKS 
100 1 
100 - 
80 - % OF 
PEAK 
WIDTH 
BAND- 40- 
2o 
0 
- 
PEAK 60 .-.-.-.-.-.-.-.-.- 
................................... 
20 
% OF 
WIDTH 
8 16 32 64128256 
FIFO DEPTH 
FIG. 35E 
--.-.-.-.-.-.-.-.- 
60-\ 
*- ----- ------. 
* I  
................................... 
I I I I I 
100 - 
80 - % OF 
PEAK 60-- 
BAND- 4o 
WIDTH 
20 
0 
- 1 BANK --- 4 BANKS 
---- 2 BANKS . - m m - . m  8 BANKS 
.-.-.-.-.-.-.-.-.- % OF 
PEAK 60 
.-.-.-.-.-.-.-.-.- -- 
*I------------. 
* ............................... -...*- 
I I I I I 
- - - - - - - - - - - - - - ,  
................................... WIDTH BAND- 40T 
2oL 0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 35D 
U S .  Patent 
100 - 
- % OF 80 
PEAK 60-’ 
WIDTH 
BAND- 40 -’ 
- 
20 - 
0 
Nov. 28,2000 Sheet 36 of 65 
SMC 100 ....... SMC .e. 0; c - -.5 ...... . 
d S O #  @ 80 - 
PEAK 60- 
,.’ ## *.* % OF /. ..... #‘ 
,) 
*.*e WIDTH 
.... 
NONSMC - - - - - - - - - - - - - - .  -, - .... *. BAND- 4oi- 
20 -. ................................... NONSMC -.-.-.-.-.-.-.-.-.- 
- - - - - - - - - - - - - - - I  .................................... 
I I I I I 0 I I I I 1  
6,154,826 
- 1 BANK --- 4 BANKS 
m - m - 2  BANKS ....... 8 BANKS 
WIDTH 
20 NONSMC 
I.. ................................. 
V I  I I  
8 16 3; 6b128256 
FIFO DEPTH 
FIG. 36A 
- 1 BANK --- 4 BANKS 
-2 BANKS .------ a BANKS 
- 1 BANK --- 4 BANKS 
.--- 2 BANKS ....... 8 BANKS 
BAND- 4 o f v ,  I I , , 
WIDTH NONSMC 
20 .-.-.-.-.-.-.-.-.- 
- - - - - - - - - - - - - - I  ................................... 
0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 36B 
- 1 BANK --- 4 BANKS 
---- 2 BANKS ....... 8 BANKS 
- 1 BANK --- 4 BANKS 
- - - - 2  BANKS -..---- 8 BANKS 
- 1 BANK --- 4 BANKS 
---- 2 BANKS ....... 8 BANKS 
100 
SMC 8 0 r -  e.*;e .-:.:.z: SMC 
% OF .**. 
PEAK 60 0)‘..-..* 
BAND- 4oy 
WIDTH NONSMC 
20 .-.-.-.-.-.-.-.-.- NONSMC 
WIDTH 
- - - - - - - - - - - - - - I  ................................... o r  I I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 36E 
-- - - - - - - - - - - - - .  ................................... o f  I I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 36F 
U S .  Patent 
- 
8o - % OF 
PEAK 60- 
WIDTH 
BAND- 40:- 
20-' 
0 
Nov. 28,2000 Sheet 37 of 65 
---- - - - - - - - - - - .  ................................... 
NONSMC --------------.  
................................... 
I I I I I 
6,154,826 
80- - % OF 
PEAK 60-' 
BAND- 4o -I  
WIDTH 
20 -- 
0 
- 1 BANK --- 4 BANKS 
.-.- 2 BANKS ....... 8 BANKS 
SMC ----- ....................... .e* 
..... .. ..... .. --.-.-.-.-.-.-.-.-.- 
NONSMC - - - - - - - - - - - - - - ,  
..................................... 
I I I I I 
.*.-. %OF 100 8 0 r s M c  
,0',... .............. 
PEAK 60 
BAND- 4o 
WIDTH .* 
20 NONSMC .-.-.-.-.-.-.-.-.- 
- - - - - - - - - - - - - - I  .................................. 
I I I I 1 
8 16 32 64128256 
FIFO DEPTH 
FIG. 37A 
- 1 BANK --- 4 BANKS 
--.- 2 BANKS ....... 8 BANKS 
100, 
,' .*.*.- PEAK 60 
WIDTH NONSMC 
- - - - - - - - - - - - - - .  
8 16 32 64128256 
FIFO DEPTH 
FIG. 37C 
- 1 BANK --- 4 BANKS 
e-.- 2 BANKS ---.--- 8 BANKS 
1001 
- 1 BANK --- 4 BANKS 
.-.- 2 BANKS ....... 8 BANKS 
1001 
%OF 80- ...... SMC 
............ PEAK 60 ~...*. 
*.+*- 
BAND- 40f I I I I , NONSMC WIDTH .-.-.-.-.-.-.-.-.- 
20 
0 
- - - - - - - - - - - - - - ,  ................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 37B 
%OF 8op 
PEAK 60 ))'' ......... & ...::- ......... I_ -  -.- 
BANKS 
BANKS 
................................... 
o t  I I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 37F 
U S .  Patent 
- 
- 
Nov. 28,2000 Sheet 38 of 65 
....... ..... _.-. ---:.: SMC -<--40,- ......... 
-e' ....... % OF 
-* (#* 
...... 
WIDTH NONSMC NONSMC -.-.-.-.-.-.-.-.-.- 
- - - - - - - - - - - - - - - I  .................................... 
I I I I 1  
6,154,826 
-1 BANK - - -4  BANKS 
-- - -2  BANKS -----.-8 BANKS 
100 -._.-.-- '.-.:' --.- 
% OF 80 rTsMC 
PEAK 60 
BAND- 4o f.-****v 
WIDTH 
20 NONSMC - - - - - - - - - - - - - -, ................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 38A 
.-.-.-.-.-.-.-.-.- - - - - - - - - - - - - - - .  
8 16 32 64128256 
FIFO DEPTH 
FIG. 38B 
-1 BANK -- -4  BANKS -1 BANK - - -4  BANKS 
.-.-2 BANKS ....... 8 BANKS -- - -2  BANKS ---8 BANKS 
SMC 100 
PEAK 60 
% OF 
BAND- 4o 
20 
0 
WIDTH 
100 
80 % OF 
PEAK 60 
WIDTH 
BAND- 4o 
20 
NONSMC - - - - - - - - - - - - - - .  
WIDTH NONSMC ................................... .-.-.-.-.-.-.-.-.- 
- - - - - - - - - - - - - - .  
8 16 32 64128256 8 16 32 64128256 
FIFO DEPTH FIFO DEPTH 
FIG. 38C FIG. 38D 
-1 BANK ---4BANKS 
*---2 BANKS ""-"8 BANKS 
-1 BANK -- -4  BANKS 
- - - -2  BANKS --8 BANKS 
- - - - - - - - - - - - - - .  ................................... 
I I I I I I 0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 38E 
8 16 32 64128256 
FIFO DEPTH 
FIG. 38F 
U S  Patent 
8o - % OF 
PEAK 60- 
WIDTH 
,,---is. -. . SMC ....................... %OF PEAK 60 
WIDTH 
,* 
...... 
20 .-.-.-.-.-.-.-.-.- 
,’ 
.*.-. BAND- 4o BAND- 4oi- 
2o 
0 0 
NONSMC - 
................................... 
Nov. 28,2000 Sheet 39 of 65 
SMC --------- - - - - - .  
.................................. 
NONSMC --------------. 
.................................... 
I I I I 1 
- 1 BANK _ _ _  4 BANKS 
.--- 2 BANKS ....... 8 BANKS 
100 
(.**,,* -.-. -. - SMC 
- 0 -  ,@ .......... ----- %OF 
PEAK 60 
WIDTH -.*’ 
.......... 
i BAND- 4o .*.. 
20 NONSMC .-.-.-.-.-.-.-.-.- 
- - - - - - - - - - - - - - I  .................................. 
I I I I i 
8 16 32 64128256 
FIFO DEPTH 
FIG. 39A 
- 1 BANK --- 4 BANKS 
---- 2 BANKS --..--. 8 BANKS 
6,154,826 
- 1 BANK _ _ _  4 BANKS 
.-.- 2 BANKS ....... 8 BANKS 
1001 
%OF 8op ,---:-. ......... --- SMC 
......... PEAK 60 0 0 0  ..... 
BAND- 4ot*.”’ 
WIDTH .-.-.-.-.-.-.-.-.- NONSMC 
--------- - - - - - .  
8 16 32 64128256 
FIFO DEPTH 
FIG. 396 
- 1 BANK --- 4 BANKS 
.-a- 2 BANKS ....... 8 BANKS 
- 1 BANK --- 4 BANKS - 1 BANK --- 4 BANKS 
- - a -  2 BANKS ....... 8 BANKS - - - -  2 BANKS . - * * - - -  8 BANKS 
100 7 100 7 
% OF 
PEAK 
WIDTH 
BAND- 
................ ............ 60 
40 .-._.-.-.-._.-._._ 
................................... 2okNoNsMc 0 8 16 32 64128256 
FIFO DEPTH 
FIG. 39E 
................................... 
0 1  , I I 1 1 
8 16 32 64128256 
FIFO DEPTH 
FIG. 39F 
U S .  Patent 
100 - 
80 - % OF 
PEAK 
BAND- 4ol 
WIDTH 
20 
0 
Nov. 28,2000 Sheet 40 of 65 6,154,826 
-.-.-.-.-.-.-.-.- 60-\ ' 
_ - - - - - - - - - - - - - .  
............................... 
--.-** 
I I I I 1 
- 1 BANK --- 4 BANKS 
---- 2 BANKS -----.- 8 BANKS 
100 1 
100- 
80-. % OF 
PEAK 60- 
WIDTH 
BAND- 4oi- 
20 - 
0 
% O F  PEAK 8oL- 60 
--.-.-.-.-.-.-.-.-.- 
- - - - - - - - - - - - - - .  
............................... .... 
I I I I I 
-.-.-.-.-.-.-.-. 
--------------. 
2o WIDTH .................................. 
BAND- 40L 0 8 16 32 64128256 
FIFO DEPTH 
FIG. 40A 
100 - 
80 - % OF 
PEAK 
- BAND- 4o - 
2o 
0 
WIDTH 
- 
100 - 
80 - 
6 O - J  -.-.-.-.-.-_-.- PEAK 60-- 
% OF 
.-.-.-.-.-.-.-.-.- 
c 
- . -- - - - - - - - - - - e)------------. BAND- 4o -- WIDTH @ 
20 -,..** 
................................... ............................... 
I I I I I 0 I I I I 1 
- 1 BANK --- 4 BANKS 
--- 2 BANKS .------ 8 BANKS 
2oL 0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 40B 
U S .  Patent 
100 - 
% OF 80 
PEAK 60 - 
WIDTH 
BAND- 40- 
20 - 
Nov. 28,2000 Sheet 41 of 65 
SMC 100 bc-.-.- .-.-.-.-.-.-, ,.---;:-.:.r.r:.SMC 
.).e. ,e'..- 
80 - c ' ' +*** ...... M" PEAK 60- 
-' .*-
.a 
*.-* 
% OF 
BAND- 4ol- 
WIDTH 
20 -. 
NONSMC --------------. 
................................... NONSMC -.-.-.-.-.-.-.-.-.-, -- - - -- - -- - - ---- -. .................................... 
6,154,826 
I I I I 1 01 0 
- 1 BANK --- 4 BANKS 
* - - -2  BANKS ....... 8 BANKS 
I I I I 1  
................................... 
0 7  I I I I 1 
8 16 32 64128256 
FIFO DEPTH 
FIG. 41A 
100 - 
% OF 80 
BAND- 4o - - .  
PEAK 60 
WIDTH 
20 - 
0 
- 1 BANK --- 4 BANKS 
a - m - 2  BANKS ....... 8 BANKS 
.-.j;-~,,.~.~:".~'SMC 100 - #.-;:>---.:.s SMC 7- '#- ...... ......... -* ," 80 -G:...-**** ,** % OF -# ,' .... 
PEAK 60 -*e' ,.-.**' 
*.ff 
-' .*'''
f 
f .  BAND- 40 1 .+*** 
NONSMC WIDTH 
20 - -.-.-.-.---.-.-.-.-, NONSMC -.-.-.-.-.-.-.-.-.-. 
---------------. -- - - - - - - - -- - - - - -. .................................... ................................... 
I I I I 1  0 ;  I I I I 1 
.-.-.-.-.---.-.-.-, --------------. 
8 16 32 64128256 
FIFO DEPTH 
FIG. 41B 
U S .  Patent 
100- 
% OF 
PEAK 60- 
BAND- 4o 
WIDTH 
20 - 
Nov. 28,2000 Sheet 42 of 65 
SMC ...*. - - - - - - I  -.-.-.-.. 
***I 0 0 * 0  
' ...... 
80-<.- ............... 
,0' 
i'. ....... 
NONSMC -.-.-.-.-.-.-.-.-.- --------------- .................................... 
I I I I I 
6,154,826 
100 
8o _. % OF 
PEAK 60- 
BAND- 4oI- 
WIDTH 
20- 
0 
.-._.-.-.-.-._.-.- SMC -------------- ................................... 
NONSMC - - - - - - - - - - - - - - .  
.................................... 
I I I I I 
% OF 
PEAK 
WIDTH 
BAND- 
20 
0 
- 1 BANK --- 4 BANKS 
---- 2 BANKS ....... 8 BANKS 
100, 
-- - - - - - - - - - - - - - - 
..................................... 
I I I I 1 
................... .... SMC 
NONSMC 
8 16 32 64128256 
FIFO DEPTH 
FIG. 42C 
- 1 BANK --- 4 BANKS 
- - - -  2 BANKS ..----- 8 BANKS 
1001 
BAND- 4or# I I , 
WIDTH 
.-.-.-.-.-.-.-.-.- 
NONSMC 20 -------------- 
................................... 
0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 42E 
- 1 BANK --- 4 BANKS 
---- 2 BANKS --.*.-- 8 BANKS 
1001 
% OF 
PEAK 60 ..*- ..... 
WIDTH .-._._.-._.___._._ NONSMC 
20 -------------- ................................... 
BAND- 40k 0 8 16 32 64128256 
FIFO DEPTH 
FIG. 42B 
- 1 BANK --- 4 BANKS 
- - - -  2 BANKS ....... 8 BANKS 
1001 
U S .  Patent 
100 - 
80 - % OF 
PEAK 60 
BAND- 4o 
2o 
0 
WIDTH 
- 
Nov. 28,2000 Sheet 43 of 65 6,154,826 
--.-.-.---.-.-.- .- .- 
---------------- 
.................................... 
I I I I 1 
100- 
%OF 8oI 
PEAK eo-' 
BAND- 4 0 -  
WIDTH 
20 -- 
0 
- 
-1 BANK --- 4 BANKS 
----2 BANKS -----.- 8 BANKS 
% OF 
PEAK 60 
-.-.-.-.-.-.-.-.-.- 
**------------. 
* ................................... 
I I I I I 
................................... 
8 16 32 64 128 256 
FIFO DEPTH 
FIG. 43C 
100- 
80- % OF 
PEAK 60- 
WIDTH 
BAND- 4oi- 
20- 
0 
-.-.-.-.-.-.-.-.-.- 
-------------- 
.................................... 
I I I I I 
100 - 
80 - 
PEAK 60- 
BAND- 4o 
2o 
0 
% OF 
- 
WIDTH -. 
.-.-.-.-.-.-.-.-.- -- 
--------------- 
................................... 
I I I I 1 
100 - 
80 - % OF 
PEAK 
BAND- 4o - 
WIDTH 
20 -.. 
0 
.-.-.-.-.-.-.-.-.- 60- -- 
* C  
.................................. 
1 I I I 1 
U S .  Patent 
100 - . 
80 % OF 
PEAK 60 - 
WIDTH 
BAND- 40:. 
20 
0 
Nov. 28,2000 Sheet 44 of 65 
.-.---;L-.:.z~J SMC 100- * -;- SMC 
#.0 0** ...... /-  00  ...... -5 +..+**' % OF 80-,.6" ** M~ . ...... -' ,' ..f. -,/ ......... ,' ..... PEAK 60- 
+**+. .*. ..+. BAND- 4o - -, 2.
WIDTH NONSMC 
NONSMC 20--.-.-.-.-.-.-.-.-.- 1- - --- - - - - - - - - - -. ---------- .................................... .................................... 
I I 1 I 1 I I I I 1 
6,154,826 
- 1 BANK --- 4 BANKS 
.-.-2 BANKS ..-..-. 8 BANKS 
- 1 BANK --- 4 BANKS 
.-.-2 BANKS ....... 8 BANKS 
-1 BANK ---4BANKS - 1 BANK --- 4 BANKS 
.-.-2 BANKS ....... 8 BANKS a - m - 2  BANKS ....... 8 BANKS 
100 8 0 r 4 T S M '  ':/7sMc 
PEAK 60 PEAK 60 
WIDTH 
NONSMC - - - - - - - - - - - - - - - .  BAND- 4o BAND- 4o 
WIDTH NONSMC 20 .-.-.-.-.-.-.-.-.- ................................... 
- - - - - - - - - - - - - - - .  ................................... 
0 )  1 1 I I 1 
8 16 32 64128256 
0 1
8 16 32 64128256 
FIFO DEPTH FIFO DEPTH 
FIG. 44C FIG. 44D 
WIDTH 
-------- - - - - - - .  ................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 44E 
- 1 BANK --- 4 BANKS 
---- 2 BANKS ....... 8 BANKS 
WIDTH NONSMC 
8 16 32 64128256 
FIFO DEPTH 
FIG. 44F 
U S .  Patent 
100 
8o - % OF 
PEAK 60- 
WIDTH 
BAND- 4oi- 
20 -' 
0 
Nov. 28,2000 Sheet 45 of 65 
SMC -. -. -. - . - . - . - . - . -. - 
- - - - - - - - - - - - - - - I  ................................... 
NONSMC - - - - - - - - - - - - - - I  
................................... 
I I I I I 
1 
% OF 
PEAK 
WIDTH 
BAND- 
- 1 BANK --- 4 BANKS 
--.- 2 BANKS ....... 8 BANKS 
00 
80 
60 
40 
20 
---------------. ................................... 
I 1 I I I 1 
8 16 32 64128256 
FIFO DEPTH 
FIG. 45A 
r-.- ............... * -  - - . 
.* )' /*- 
*.-' . . ** .... 
-.-.-.-.-.-.-.-.-.- 
SMC 
NONSMC 
- 1 BANK --- 4 BANKS 
m - m - 2  BANKS -------  8 BANKS 
100, 
% OF 
PEAK 
WIDTH 
BAND- 
- - - - - - - - - - - - - - - .  ................................... 
0 )  I I I I I 
SMC 
NONSMC 
8 16 32 64128256 
FIFO DEPTH 
FIG. 45C 
- 1 BANK --- 4 BANKS 
- - * -  2 BANKS ....... 8 BANKS 
100 1 
................................... 
o ?  I I I I 1 
8 16 32 64128256 
FIFO DEPTH 
FIG. 45E 
6,154,826 
- 1 BANK --- 4 BANKS 
---- 2 BANKS ....... 8 BANKS 
1001 
% OF 8 0 r 7 - 1  .................... # #  - - 2: . ....... SMC 
PEAK 60 ..... 
BAND- 40\ *.** I I I ,NONSMC 
WIDTH -.-. -.- .-._.-.-. -._ 
20 - - - - - - - - - - - - - - .  ................................... 
0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 45B 
- 1 BANK --- 4 BANKS 
---- 2 BANKS ..*--.- 8 BANKS 
100 1 
-.-.-.-.-.-.-.-.-.- - - - - - - - - - - - - - - - ,  20 - ..................................... 
0 I I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 45F 
U S .  Patent 
100 - 
% O F  80T 
PEAK 60- 
WIDTH 
BAND- 4o - 
20 - 
0 
Nov. 28,2000 Sheet 46 of 65 6,154,826 
_.-.-.-.-._._._._._ 
------- - --- --- - -. 
.................................... 
I I I I 1 
100- 
% O F  80: 
BAND- 
WIDTH 
PEAK 
20 
0 
- 1 BANK --- 4 BANKS 
-- -- 2 BANKS ....... 8 BANKS 
loo 1 
-.-.-.-.-.-.-.-.-.- 60-- 
40-,,,----- - -_-___. 
................................. --'. 
I I I I I 
% O F  8oj 
100- 
80- % OF 
PEAK 6o 
BAND- 4oi 
WIDTH 
20 - 
0 
- _.-.-.-.-.-.-.-.- PEAK 60 
WIDTH BAND- .()I-, 2o 
................................... 
0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 46C 
-.-.-.-.-.-.-.-.-.- 
- - - - - - - - - - - - - - - I  
.................................... 
I I I I 1 
100 - 
80 - 
PEAK 60- 
BAND- 4o 
2o 
0 
% OF 
- 
WIDTH _. 
-.-.-.-.-.-.-.-.-.- 
---------------.  
................................... 
I I I I I 
- 1 BANK --- 4 BANKS 
* - - -  2 BANKS ....... 8 BANKS 
loo 1 
% O F  8oj 
-.-.-.-.-.-.-.-.-.- PEAK 60 
WIDTH 2o ................................... 
B A N D - 4 0 c  0 8 16 32 64128256
FIFO DEPTH 
FIG. 46F 
U S .  Patent 
100 - 
% O F  
PEAK 60 - 
WIDTH 
BAND- 4o 
2o 
0 
- 
Nov. 28,2000 Sheet 47 of 65 
. - . -*:.n.m SMC -.-.zL--.T.z! SMC 100- ,. +.-' * ,e-...-  e-2 e -.+..... 8 0 - 6 -  % O F  80-,.*' ,*I /. ***' 
,##:.+*' PEAK 60-'0:...'.*' 
.*. -:.... .e- BAND- 40- ,  - 
WIDTH NONSMC 
NONSMC 20--.- .- .- .- .- .- .- .- .-  -.-.-.-.-.-.-.-.-.- - - - - - - - - - - - - - - - .  .................................... -- -- - - - --- - -- - -- ,  .................................... 
I I I I I I I I I I 
6,154,826 
100 
80-" % OF 
PEAK 60- 
WIDTH 
BAND- 4ol- 
2o 
0 
- 
"7 SMC 
NONSMC --------- - - - - - .  
.................................... 
I I I I I 
- 1 BANK - - - 4  BANKS 
a - - -  2 BANKS -.---.* 8 BANKS 
- - - - - - - - - - - - - - I  ................................... o r  I 1 I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 47C 
- 1 BANK --- 4 BANKS 
. - - -2  BANKS ....... 8 BANKS 
% OF 
PEAK 60 
WIDTH 
20 .- .---.-.-.-.-.-.- - - - - - - - - - - - - - - .  
8 16 32 64128256 
FIFO DEPTH 
FIG. 47E 
- 1 BANK --- 4 BANKS 
.--- 2 BANKS --.---. 8 BANKS 
NONSMC 
8 16 32 64128256 
FIFO DEPTH 
FIG. 47F 
U S .  Patent 
80 - % OF - 
PEAK 60- 
BAND- 4oI- 
WIDTH 2o 
0 
- 
Nov. 28,2000 Sheet 48 of 65 
b SMC - - - - - - - - - - - - - - - .  ............................... 
,..+' 
NONSMC --------------- 
.................................... 
I I I I I 
100 
%OF 8o 
PEAK 60 
WIDTH 
BAND- 4o 
20 
PEAK 60-' 
BAND- 4o 
WIDTH 
20 -- 
0 
- 1 BANK _-- 4BANKS 
.-.- 2 BANKS ....... 8 BANKS 
............. ......... ........ .... 
--.-.-.-.-.-.-.-.-.- -5.' 
NONSMC - - -- - - -- - -- - --. 
..................................... 
I I I I 1  
SMC 
NONSMC .-.-.-.-.-.-.-.-.- .................................. 
I I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 48A 
- 1 BANK --- 4 BANKS 
e - - -  2 BANKS ----.-- 8 BANKS 
100 7 
Yo OF 
PEAK 
WIDTH 
BAND- 
SMC 
8 16 32 64128256 
FIFO DEPTH 
FIG. 48C 
6,154,826 
- 1 BANK --- 4 BANKS 
.-.- 2 BANKS ....... 8 BANKS 
1001 
BAND- ::[# I I I I 
WIDTH .-.-.-.-.-.-.-.-.- NONSMC 
- - - - - - - - - - - - - - .  ................................... 
0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 48B 
..... ...... ...... 
WIDTH NONSMC .-.-.-.-.-.-.-.-.- - - - - - - - - - - - - - - .  
8 16 32 64128256 
FIFO DEPTH 
FIG. 48F 
U S .  Patent 
100- 
80 - % OF 
PEAK 60- 
BAND- 40 - 
WIDTH 
20 
0 
Nov. 28,2000 Sheet 49 of 65 
-.-.-.-.-.-.-.-.-.- 
c(- - -- - - - - --- -. 
4 ............................... -,....* 
I I I I I 
6,154,826 
80 - 
% OF 
PEAK 60-- 
WIDTH 
BAND- 40:e’ 
2o - 
0 
- 1 BANK --- 4 BANKS 
.-.-2 BANKS ....... 8 BANKS 
% OF 
-.-.-.-.-.-.-.-.- 
------ ------. 
c 
................................... 
I I I I 1 
................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 49A 
100- 
80 - % OF 
PEAK 60- 
WIDTH 
BAND- 4oI- 
20 - 
0 
--.-.-.-.-.-.-.-.-.- 
--------------. 
................................. .... 
I I I I 1 
80 - % OF 
PEAK 60- 
WIDTH 
BAND- 4o 
2o 
0 
- 
-. 
.-.-.-.-.-.-.-.-.- -- 
- - - - - - - - - - - - - - - .  
................................... 
I I I I I 
100 - 
80 - 
PEAK 60- 
BAND- 4o 
WIDTH 
20 
0 
% OF 
- 
.-.-.-.-.-.-.-.-.- -- 
*e------------. 
................................ c 
-.us 
I I I I 1 
U S .  Patent 
100 
80 - % OF 
PEAK 60- 
WIDTH 
BAND- 4ol- 
20 -. 
0 
Nov. 28,2000 Sheet 50 of 65 
SMC I: ..-.-.-.-.-.-.-.-.- 
NONSMC --------- - - - - - .  
................................... 
I I I I 1  
- 1 BANK --- 4 BANKS 
--.- 2 BANKS ....... 8 BANKS 
100 - 
80 % OF 
PEAK 60 - 
WIDTH 
BAND- 4o - 
20 -- 
BAND- 4o f..**r' I I I I 
WIDTH 
NONSMC - - - - - - - - - - - - - - .  ................................... 20 
0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 50A 
/.--::-*-:.7.: SMC -< ~ 0 *.*...--
r .  .e ........ 
,,' ,./ 
-' ... *.** 
-8.. NONSMC 
.-. -.-. -. -. - .- .-. - - - - - - - - - - - - - - - - .  .................................... 
I I I I i 
- 1 BANK --- 4 BANKS 
- - e -  2 BANKS ..*---- 8 BANKS 
- - - - - - - - - - - - - - .  ................................... 
O T  I I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 50C 
-1 BANK --- 4 BANKS 
e-.- 2 BANKS .--.-.- 8 BANKS 
--------------.  .................................. 
I I I I 1  
8 16 32 64128256 
FIFO DEPTH 
FIG. 50E 
6,154,826 
- 1 BANK --- 4 BANKS 
---- 2 BANKS -------  8 BANKS 
.. 0 . =. .,:,: .... 100 
%OF 8or-sMc .c . - c :. ..... 
PEAK 60 .* 
BAN D . -$..*'- 
WIDTH NONSMC 
.-.-.-.-.-.-.-.-.- 
- - - - - - - - - - - - - - I  
8 16 32 64128256 
FIFO DEPTH 
FIG. 50B 
U S .  Patent 
8o _. % OF 
PEAK 60- 
WIDTH 
BAND- 4oI- 
2o 
0 
- 
Nov. 28,2000 Sheet 51 of 65 
--------------- .................................. 
NONSMC - - - - - - - - - - - - - - .  
.................................... 
I 1 I I I 
1 
% OF 
PEAK 
WIDTH 
BAND- 
-1 BANK ---4BANKS 
----2 BANKS ....... 8 BANKS 
00 
80 
60 
40 
20 
- - - - - - - - - - - - - - .  ................................... 
I I I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 51A 
SMC 
NONSMC 
% OF 
PEAK 
WIDTH 
BAND- 
- 1 BANK --- 4 BANKS 
-2 BANKS ....... a BANKS 
100 7 
.................. 60 ..... 
40 .... 
SMC 
.-.-.---.-.-.-.-.- 
- - - - - - - - - - - - - - I  ................................... 
2o 0_NONSMC
8 16 32 64128256 
FIFO DEPTH 
FIG. 51C 
6,154,826 
- 1 BANK --- 4 BANKS 
m - S - 2  BANKS .---.-. 8 BANKS 
I 0 0 1  
% OF 
PEAK 60 
............ 
BAND- 4o $......''~ 
WIDTH .-.-.-.-.-.-.-.-.- NONSMC 
- - - - - - - - - - - - - - .  
8 16 32 64128256 
FIFO DEPTH 
FIG. 51B 
-1 BANK - - - 4  BANKS - 1 BANK --- 4 BANKS 
- - - - 2  BANKS -------  8 BANKS -2 BANKS --.---- a BANKS 
100 1 100 1 
%OF PEAK 8 o r : c  60 .... .................... %OF PEAKIp 60 .,...-z:.~--=; SMC 
BAND- 40 
WIDTH 
..... BAND- 4o ..... . - . - . -. - . -. - . - . - . - 
NONSMC 
..J. 
.-.-.-.-.-.-.-.-.- - - - - - - - - - - - - - - .  WIDTH NONSMC 20 
0 
- - - - - - - - - - - - - - ,  
................................... ................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 51E 
0- 
8 16 32 64128256 
FIFO DEPTH 
FIG. 51F 
U S .  Patent 
100 - 
80 - % OF 
PEAK 60 
BAND- 4o - 
20 - 
0 
WIDTH 
Nov. 28,2000 
--.-.-.-.-.-.-.-.-.- 
---------------- 
.................................... 
I I I I I 
100 - 
80 - % OF 
PEAK eo-' 
BAND- 4o 
20 
0 
- 
WIDTH 
-.-.-.-.-.-.-.-.-.- 
---- - - - - - - - - - .  _- ................................. -.*. 
I I I I I 
100 - 
80 - 
% OF 
PEAK 60-- 
WIDTH 
BAND- 4ol- 
20 
0 
Sheet 52 of 65 
._.-.-.-.-.-.-.-.-. 
~ - - - - - - - - - - - - . 
................................ 
0 
-.-.. 
I I I I 1 
6,154,826 
100 - 
80 - 
60-- 
40 1. 
20 - 
0 
.-.-.-.-.-.-.-.-.- 
--------------- 
................................... 
I I I I 1  
% OF 
PEAK 
WIDTH 
BAND- 
100 - 
80 - 
PEAK 60- 
BAND- 4o 
2o WIDTH 
0 
% OF 
- 
- 
.-.-.-.-.-.-.-.-.- -- 
--------------- 
................................... 
I I I I 1 
100 - 
80 - % OF 
PEAK 60-, 
BAND- 4o 
20 
0 
- 
WIDTH 
.-.-.-.-.-.---.-.- -- 
a*-----..------. 
................................ -.... 
I I I I I 
U S  Patent Nov. 28,2000 Sheet 53 of 65 
- 1 BANK --- 4 BANKS 
-2 BANKS ....... a BANKS 
% OF 
PEAK 60 F'- .. . /*- 
WIDTH 
BAND- 4o **** 
20 NONSMC -------------- _ _  ................................... o l :  I I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 53A 
1 
% OF 
PEAK 
WIDTH 
BAND- 
- 1 BANK --- 4 BANKS 
a - m - 2  BANKS ....... 8 BANKS 
00 
80 
60 
40 
20 ---------------.  ................................... 
o ?  I I I I 1 
8 16 32 64128256 
FIFO DEPTH 
FIG. 53C 
SMC 
+.-.-.-.-.-.-.-.-.- NONSMC 
-1 BANK --- 4 BANKS 
---- 2 BANKS --.-.-- 8 BANKS 
f ' 
WIDTH I+, NONSMC 
20 -.-.-.-.-.-.-.-.-.- - - - - - - - - - - - - - - - .  ................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 53E 
6,154,826 
- 1 BANK --- 4 BANKS 
m - m - 2  BANKS ....... 8 BANKS 
BAND- 4o ]+.=*' 
WIDTH NONSMC 
-.-.-.-.-.-.-.-.-.- 
- - - - - - - - - - - - - - - .  
8 16 32 64128256 
FIFO DEPTH 
FIG. 53B 
- 1 BANK --- 4 BANKS 
--.-2 BANKS ....... 8 BANKS 
100 7..- SMC 
%OF "1 PEAK 60 
................................... 2oL 0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 53D 
- 1 BANK --- 4 BANKS 
-2 BANKS ---.--- a BANKS 
BAND- 4o 1.. /" 
WIDTH NONSMC 
- - - - - - - - - - - - - - .  
8 16 32 64128256 
FIFO DEPTH 
FIG. 53F 
U S .  Patent 
80 - % OF 
PEAK 60- 
BAND- 4oI- 
WIDTH 2o 
- 
0 
Nov. 28,2000 Sheet 54 of 65 
.-.-.-.-.-.-.-.-.-, SMC -- - - - - - - - --- - - --. ................................... 
--------------. NONSMC 
.................................... 
I I I I 1 
100 
% OF 80 
PEAK 60 
WIDTH 
BAND- 4o 
20 
1 
% OF 
PEAK 
WIDTH 
BAND- 
-1 BANK - - - 4  BANKS 
.-.-2 BANKS ....... 8 BANKS 
................ 
..... 
NONSMC .-.-.-.-.-.-.-.-.-. --------------. 
8 16 32 64128256 
FIFO DEPTH 
FIG. 54A 
00 
80 
60 
40 
20 --------------. ................................... o r  I I I I 1 
8 16 32 64 128 256 
FIFO DEPTH 
FIG. 54C 
-1 BANK - - - 4  BANKS 
m - m - 2  BANKS .--.--.8 BANKS 
7 
e--.- - ::* -----. SMC 
@ @ -  ................. ,..-. 
.@@ ..... ..... 
NONSMC .-.-.-.-.-.-.-.-.-* 
100 
80 % OF 
PEAK 60 
BAND- 4o 
20 
WIDTH 
-1 BANK - - - 4  BANKS 
----2 BANKS ---8 BANKS 
1 
...................... ....... 
-.-.-.-.-.-.-.-.-.-. 
--------------. 
0 k N O N S M C  ................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 54E 
6,154,826 
-1 BANK - - - 4  BANKS 
.-.-2 BANKS ....... 8 BANKS 
100 1 
*.-.-. 
SMC :- - -. .............. % OF ......... 
PEAK 60 @@' *.* 
BAND- l:r~ I , I ,
NONSMC WIDTH .-.-.-.-.-.-.-.-.-. 
--------------. ................................... 
0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 54B 
100 
% OF 80 
PEAK 60 
WIDTH 
BAND- 4o 
20 
- 1 BANK -- - 4 BANKS 
m - m - 2  BANKS -.----. 8 BANKS 
1 
/.---. 
0' ........... ---- SMC p..... ...... .......... 
NONSMC --------------. 
................................... 
0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 54F 
U S .  Patent 
100- 
% O F  8oI 
PEAK 60;: 
BAND- 40- 
WIDTH 
20 
0 
Nov. 28,2000 Sheet 55 of 65 6,154,826 
.-.-.-.-.- * -.-.-.- 
_------------.  _ c  
.............................. 
-.*--*- 
I I I I 1 
................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 55A 
100- 
80 - % OF 
PEAK 60-- 
WIDTH 
BAND- 4oi 
20- 
0 
- 1 BANK --- 4 BANKS 
-2 BANKS ------. a BANKS 
100 
% OF .-.-.-.-.-.-.-.-.- 
- - - - - - - - - - - - - - - .  
.................................... 
I I I I 1 
_.-.-.-.-.-.-.-.- PEAK 60 
WIDTH 
cec------------ 
.............................. BAND- 20 40r ..a*. 
100 - 
80 - % OF 
PEAK 60- 
BAND- 4o - 
2o 
0 
WIDTH 
- 
0 1
8 16 32 64128256 
FIFO DEPTH 
FIG. 55C 
-.-.-.-.-.-.-.-.-.- 
- - - - - - - - - - - - - - - .  
................................... 
I I I I I 
100 - 
80 - % OF 
PEAK 60-- 
WIDTH 
BAND- 4o - - 
20 
0 
-.-.-*---.-.-.-.-.- 
c ............................... -...*- 
I I I I I 
U S .  Patent Nov. 28,2000 Sheet 56 of 65 6,154,826 
- 1 BANK _ _ _  4 BANKS 
--.- 2 BANKS ....... 8 BANKS 
-1BANK ---4BANKS 
.-.- 2 BANKS ....... 8 BANKS 
PEAK 60 .... 
WIDTH WIDTH NONSMC 
.... 
NONSMC 
- - - - - - - - - - - - - - .  
20 
8 16 32 64128256 8 16 32 64128256 
FIFO DEPTH 
FIG. 56A 
FIFO DEPTH 
FIG. 56B 
- 1 BANK --- 4 BANKS - 1 BANK --- 4 BANKS 
. - a -  2 BANKS ....... 8 BANKS ---- 2 BANKS -------  8 BANKS 
100 
% OF 
........ 
NONSMC 
WIDTH NONSMC ................................... 
................................... 
8 16 32 64128256 8 16 32 64128256 
FIFO DEPTH FIFO DEPTH 
FIG. 56C FIG. 56D 
- 1 BANK --- 4 BANKS - 1 BANK --- 4 BANKS 
--.- 2 BANKS ..--..- a BANKS --.- 2 BANKS ..-.--- 8 BANKS 
% OF . ..... 
PEAK 60 
WIDTH NONSMC WIDTH NONSMC 
.... 
..... 
................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 56E 
-- - - - - - - - - - - - - .  ................................... 
I I I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 56F 
U S .  Patent 
80- % OF 
PEAK 60- 
WIDTH 
BAND- 4oi- 
2o 
0 
- 
Nov. 28,2000 Sheet 57 of 65 
_--------------- 
.................................. 
-------------- NONSMC 
.................................... 
I I I I 1 
6,154,826 
- 1 BANK _ _ _  4 BANKS 
.-.- 2 BANKS ....... 8 BANKS 
SMC -.-.-.-.. 100 
%OF 8 o r -  0 0 - - -  -- - '  
PEAK 60 ,#. ..... 
WIDTH 
BAND- 40 ..+*** ..-. 
20 NONSMC .-.-.-.-.-.-.-.-.- -------------- .................................. 
I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 57A 
% OF 
PEAK 
WIDTH 
BAND- 
- 1 BANK --- 4 BANKS 
--.- 2 BANKS .--.--- 8 BANKS 
100, 
...... 
.e' ...... 
40 .......... 
60 
SMC 
NONSMC 
-------------- ................................... o r  I I 1 I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 57C 
- 1 BANK --- 4 BANKS 
---- 2 BANKS - - * - - - -  8 BANKS 
100, 
................................... 
0 1  I I I I 1 
8 16 32 64128256 
FIFO DEPTH 
FIG. 57E 
- 1 BANK _ _ _  4 BANKS 
.-.- 2 BANKS ....... a BANKS 
1007 
PEAK 60 .4e .... ......... 
BAND- 4o .+.e* 
NONSMC WIDTH .-.-.-.-.-.-.-.-.- 
-------------- 
8 16 32 64128256 
FIFO DEPTH 
FIG. 57B 
- 1 BANK --- 4BANKS 
---- 2 BANKS ------- 8 BANKS 
l o07  
...... 
WIDTH NONSMC ~- .-.-.-.-.-.-.-.-.- 
8 16 32 64128256 
FIFO DEPTH 
FIG. 57F 
U S .  Patent Nov. 28,2000 Sheet 58 of 65 6,154,826 
100 - 
80 - % OF 
PEAK 60- 
WIDTH 
BAND- 4o 
2o 
0 
- 
-.-.-.-.-.-.-.-.-.- 
---------------- 
.................................... 
I I I I 1 
- 1 BANK --- 4 BANKS 
m - m - 2  BANKS - - - - - - -  8 BANKS 
100- 
% OF 80: 
PEAK 60-- 
BAND- 401 
WIDTH 
20 
0 
loo 1 
.-.-.---.-.-.-.-.- 
~ _------_----. 
............................... @ 
-.*.* 
I I I I I 
% O F  8oj 
100- 
80 - % OF 
PEAK 60- 
WIDTH 
BAND- 4ol 
20 - 
0 
PEAK 60 
WIDTH 
c _ - - - - - - - - - - - - I  
................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 58C 
-.-.-.-.---.-.-.-.- 
--------------- 
............................... ..... 
I I I I I 
100 - 
80 - % OF 
PEAK 60- 
BAND- 4o 
2o WIDTH 
0 
- 
- 
.-.-.-.-.-.-.-.-.- -- 
--------------- 
................................... 
I I I I I 
- 1 BANK --- 4 BANKS 
m - m - 2  BANKS -.*--.. 8 BANKS 
80 % OF 
.-.-.-.-.-.-.-.-.- PEAK 60 
WIDTH ................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 58F 
U S .  Patent Nov. 28,2000 Sheet 59 of 65 
- 1 BANK _ _ _  4 BANKS 
.-.-2 BANKS ....... 8 BANKS 
BAND- 4o 
WIDTH 
20 NONSMC 
................................... 
0 )  I I I I I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 59A 
-1 BANK --- 4 BANKS 
m - m - 2  BANKS ....... 8 BANKS 
................................... 
o ?  I I I 1 1  
8 16 32 64128256 
FIFO DEPTH 
FIG. 59C 
-1 BANK - - - 4  BANKS 
----2 BANKS -.---.- 8 BANKS 
NONSMC WIDTH 
8 16 32 64128256 
FIFO DEPTH 
6,154,826 
- 1 BANK _ _ _  4 BANKS 
.-.- 2 BANKS ....... 8 BANKS 
% OF 
PEAK 60 f 
WIDTH NONSMC 
.-.-.-.---.-.-.---. 
................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 59B 
- 1 BANK --- 4 BANKS 
----2 BANKS ....... 8 BANKS 
% OF 
PEAK 60 
................................... 2oL 0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 59D 
1 
% OF 
PEAK 
WIDTH 
BAND- 
00 
80 
60 
40 
20 
0 
- 1 BANK --- 4 BANKS 
.-.- 2 BANKS --.---- 8 BANKS 
.-.-.-.-.-.-.-.-.-. 
- - - - - - - - - - - - - - - I  i N O N S M C  ................................... 
8 16 32 64128256 
FIFO DEPTH 
FIG. 59E FIG. 59F 
U S .  Patent 
60- 
40 - 
20 - 
Nov. 28,2000 Sheet 60 of 65 
,/’ .** 
t ..f 
....... 
NONSMC -.-.-.-.-.-.-.-.-.-. 
- - - - - - - - - - - - - - - I  .................................... 
I I I I I 
6,154,826 
1 
% OF 
PEAK 
WIDTH 
BAND- 
-1 BANK ---4 BANKS 
--.- 2 BANKS ---.--- a BANKS 
100 1 
% OF 
PEAK 
WIDTH 
BAND- 
20 +=NoNsMc 
- - - - - - - - - - - - - - - I  ................................... 
0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 60C 
...... 40 .... 
- 1 BANK _ _ _  4 BANKS 
. - . -2 BANKS ....... 8 BANKS 
1001 
%OF ,- .e.-.- SMC 
----- ..................... PEAK 60 .... 
WIDTH NONSMC 
8 16 32 64128256 
FIFO DEPTH 
FIG. 60B 
- 1 BANK --- 4 BANKS 
- 2  BANKS ....... a BANKS :$- ._.-.-.-.-.-.-.-. SMC 
__--------.-_ 
e - -  
% OF .... .......................... 
NONSMC 
WIDTH ................................... ‘“1 0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 60D 
- 1 BANK --- 4 BANKS - 1 BANK --- 4 BANKS 
-2 BANKS ....... a BANKS m - m - 2  BANKS *--..-- 8 BANKS 
loo 1 loo 1 
................................... 
0 1  I I I I 1 
8 16 32 64128256 
FIFO DEPTH 
FIG. 60E 
...... 
....... 
WIDTH NONSMC 
8 16 32 64128256 
FIFO DEPTH 
FIG. 60F 
U S .  Patent 
100- 
80 - % OF 
PEAK 60 -_ 
BAND- 40 i- 
WIDTH 
20 
0 
Nov. 28,2000 
. - . -. - . - . - . - . - . - . - 
- - - - - - - - - - - - - - 
................................. 
-,.a* 
I I I I I 
-------------- 
WIDTH BAND- 4O+ 
100- 
80 - F 
PEAK 60 
BAND- 4oI- 
WIDTH 
20 - 
0 
............................... 
8 16 32 64 128 256 
FIFO DEPTH 
FIG. 61A 
-. -. - --. -. - - - m -. - .- 
-------------- 
................................. .... 
I I I I 1 
- 1 BANK --- 4 BANKS 
m - m - 2  BANKS ....... 8 BANKS 
% OF 
PEAK 60 
100 - 
80 - % OF 
PEAK 60-’ 
BAND- 4o 
WIDTH 
- 
20 - 
0 
................................ 
8 16 32 64128256 
FIFO DEPTH 
FIG. 61C 
-.-.-.-.-,-.-.-.-, --- 
--------------- 
................................... 
I I I I 1 
100 - 
80 - % OF 
PEAK 60-. 
BAND- 4o 
WIDTH 
20 
0 
- 
Sheet 61 of 65 
.-.---.-.-.-.-.-.- -- 
_ - - - - - - - - - - - - - I  - ................................ 
-I-.. 
I I I I I 
6,154,826 
% 
U S .  Patent Nov. 28,2000 
100- 
80- % OF 
PEAK 60- 
WIDTH 
BAND- 4oi 
20- 
0 
MISS/HIT COST RATIOS 
I I I I i 
BAND- 40+J’v 
WIDTH 
1 
% OF 
PEAK 
WIDTH 
BAND- 
2oL 0 8 16 32 64128256 
FIFO DEPTH 
FIG. 62A 
MISS/HIT COST RATIOS 
. - . - 3  -..- 12 
I 
‘-2 ....... 8 ‘ 
2oL0 8 16 32 64128256 
FIFO DEPTH 
FIG. 62C 
MISWHIT COST RATIOS 
Sheet 62 of 65 6,154,826 
MISS/HIT COST RATIOS 
-2 ....... 8 
1 
% OF 
PEAK 
WIDTH 
BAND- 
‘“1 0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 62B 
MISS/HIT COST RATIOS 
I 
‘-2 ....... 8 ‘ 
MISS/HIT COST RATIOS 
I ‘-2 ....... 8 
. - . - 3  -..- 12 
-- 16 ---4 100 -I 
‘“1 0 
8 16 32 64 128 256 
FIFO DEPTH 
FIG. 62E 
2oL 0 8 16 32 64128256 
FIFO DEPTH 
FIG. 62F 
U S .  Patent Nov. 28,2000 
- 
MISS/HIT COST RATIOS 
-2 ....... 8 
.-.-3 -..- 12 
1 
I 
i 
-',' /.'/ ' *.' ' '0 
*. ,a*/ 
- p s i  
...',>/ 
I I I I I 
0 1  
8 16 32 64128256 
FIFO DEPTH 
FIG. 63A 
MISS/HIT COST RATIOS 
100 
80 % OF 
PEAK 60 
WIDTH 
BAND- 4o 
20 
0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 63C 
MISS/HIT COST RATIOS 
Sheet 63 of 65 6,154,826 
100 
80 % OF 
PEAK 60 
WIDTH 
BAND- 4o 
20 
0
MlSSlHlT COST RATIOS 
I 
8 16 3h 6 i l i 8 2 5 6  
FIFO DEPTH 
FIG. 63B 
MISS/HIT COST RATIOS 
. - . -3 -..-I2 ~:~r --16 I ' -2 ....... 8 + 
% OF 
PEAK 60 
2oL 0 
8 16 32 64128256 
FIFO DEPTH 
FIG. 63D 
MISS/HIT COST RATIOS 
100 
80 % OF 
PEAK 60 
WIDTH 
BAND- 4o 
2o 0i
8 16 32 64128256 
FIFO DEPTH 
FIG. 63E 
80 % OF 
PEAK 60 
WIDTH 
BAND- 4o 
20 
0 
U S .  Patent Nov. 28,2000 
MISS/HIT COST RATIOS 
-2 ....... a 
.-.-3 -..- 12 
---4 -- 16 
100 - 
0 I I I I 1 
8 16 32 64 128.256 
FIFO DEPTH 
FIG. 64A 
MISS/HIT COST RATIOS 
'-2 .......a . 
.-.-3 -..- 12 
-- 16 ---4 100, 
80 % OF 
PEAK 60 
WIDTH 
BAND- 4o 
20 
0- 
8 16 32 64128256 
FIFO DEPTH 
FIG. 64C 
MISS/HIT COST RATIOS 
Sheet 64 of 65 6,154,826 
MISS/HIT COST RATIOS -
-2 ....... 8 
.-.-3 -..- 12 
8 16 32 64128256 
FIFO DEPTH 
FIG. 64B 
MISS/HIT COST RATIOS 
I 
' -2 ....... 8 
% O F  PEAK "v- 60 
WIDTH BAND- 40j 
2oL 0 8 16 32 64128256 
FIFO DEPTH 
FIG. 64D 
MISS/HIT COST RATIOS 
'-2 ....... 8 ' 
.-.-3 -..- 12 
-- 16 ---4 100 7 
*-2 ....... 8 . 
. - . -3  -..- 12 
-- 16 ---4 100 -I 
8 16 32 64128256 
FIFO DEPTH 
FIG. 64E 
8 16 32 64128256 
FIFO DEPTH 
FIG. 64F 
U S .  Patent Nov. 28,2000 
MISS/HIT COST RATIOS 
600 
% OF 
PEAK 400 
WIDTH 
BAND- 
8 16 32 64128256 
FIFO DEPTH 
FIG. 65A 
MlSSlHlT COST RATIOS 
I 1 
-2 ....... 8 
.-.-3 -..- 12 
-- 16 ,---4 
8ooT--------- 
600 
% OF 
PEAK 400 
WIDTH 
BAND- 
200 
150 
100 
// .............. - ./- ... 
/ 
1 
16 3; 6 b I i 8 2 i 6  
FIFO DEPTH 
FIG. 65C 
Sheet 65 of 65 6,154,826 
MISS/HIT COST RATIOS 
....... 8 -2 
---- 3
--- 4 
12 
16 
-..- 
-- 
8oo -r ---- ----- 
8 16 32 64 128 256 
FIFO DEPTH 
FIG. 65B 
MISWHIT COST RATIOS -....... 8 2 
12 .-.- 3 -..- 
16 --- 4 -- 
8oo T --- ---- -- 
..-..-..-..-..-..-.. 
/‘ 
/ 
% OF 
PEAK 400 
WIDTH 
BAND- 
200 
150 
100 
8 16 32 64 128 256 
FIFO DEPTH 
FIG. 65D 
6,154,826 
1 2 
METHOD AND DEVICE FOR MAXIMIZING 
MEMORY SYSTEM BANDWIDTH BY 
ACCESSING DATA IN A DYNAMICALLY 
DETERMINED ORDER 
performance. For scientific applications, however, the pro- 
cessor is not the bottleneck. Bridging this performance gap 
requires changing the approach to the problem and concen- 
trating on minimizing average latency over a coherent set of 
s accesses in order to maximize the bandwidth for scientific 
While many scientific computations are limited by 
memory bandwidth, they are by no means the only such 
computations. Any computation involving linear traversals 
This invention was made with government support under 10 of vector-like data, where each element is typically visited 
only once during lengthy portions of the computation, can 
suffer. Examples of this include string processing, image 
processing and other DSP applications, some database 
queries, some graphics applications, and DNA sequence 
The assumptions made by most memory architectures 
simply don’t match the physical characteristics of the 
devices used to build them. Memory components are usually 
2. Brief Description of the Prior Art assumed to require about the same amount of time to access 
processor speeds are increasing much faster than memory 2o any random location; indeed, it was this uniform access time 
speeds. Microprocessor performance has increased by 50% that gave rise to the term RAM, Or Random Access Memory. 
to 100% per year in the last decade, while DRAM perfor- Many computer architecture textbooks specifically cultivate 
mance has risen only 1&15% per year. Memory bandwidth this view. Others skirt the 
is, therefore, rapidly becoming the performance bottleneck 25 Somewhat ironically, this assumption no longer applies to 
in the application of high performance microprocessors to modern memory devices as most components manufactured 
vector-like algorithms, including many of the ‘‘grand in the last ten to fifteen years provide special capabilities that 
lenge” scientific problems. Currently, it may take as much as make it Possible to Perform Some access sequences faster 
50 times as long to access a memory element than to perform than others. For instance, nearly all current DRAMS imPle- 
an arithmetic operation once accessed. Alleviating the grow- a form of Page-mode operation. These devices behave 
ing disparity between processor and memory speeds is the 30 as if implemented with a single On-chiP cache line, or Page 
subject of much current research. (this should not be confused with a virtual memory page). A 
prior art has centered on a mechanism called a “cache” memory access falling outside the address range of the 
which automatically stores the most frequently used data in current DRAM Page forces a new Page to be accessed. The 
a higher speed, smaller, and much more costly memory. The Overhead time required to set the new Page makes 
of cache technology hinges on a property called 35 servicing such an access significantly slower than one that 
“locality”, which is the tendency for a program to repeatedly hits the current Page. 
access data that is “close”. Assuming locality, a cache can Other common devices offer similar features, such as 
reasonably predict future memory accessed based on recent nibble-mode, static column mode, or a small amount of 
past references. 4o SRAM cache on chip. This sensitivity to the order of 
requests is exacerbated in emerging technologies. For 
cient solution to the memory latency and bandwidth prob- instance, Rambus, Ramlink, and the new DRAM designs 
lems in general purpose scalar computing, the vectors used with high-speed sequential interfaces Provide high band- 
in scientific computations are normally too large to cache, width for large transfers, but offer little performance benefit 
and many are not reused soon enough to benefit from 45 for single-word accesses. 
caching. Furthermore, vectors leave large footprints in the For multiple-module memory systems, the order of 
cache. For computations in which vectors are reused, itera- requests is important on yet another level, successive 
tion space tiling can partition the problems into cache-size accesses to the same memory bank cannot be performed as 
blocks, but this can create cache conflicts for some block quickly as accesses to different banks. To get the best 
sizes and vector strides, and the technique is difficult to so performance out of such a system, advantage must be taken 
automate. Caching non-unit stride vectors leaves even larger of the architecture’s available concurrency. 
footprints, and may actually reduce a computation’s effec- Most computers already have memory systems whose 
tive memory bandwidth by fetching extraneous data. “ . . . peak bandwidth is matched to the peak processor bus rate. 
while data caches have been demonstrated to be effective for But the nature of an algorithm, its data sizes, and placement 
general-purpose applications . . . , their effectiveness for 55 all strongly affect memory performance. An example of this 
numerical code has not been established”. Lam, Monica, et is in the optimization of numerical libraries for the ipse/ 
al, “The Cache Performance and Optimizations of Blocked 860. On Some applications, even with painstakingly hand- 
Algorithms”, Fourth International Conference on Architec- crafted code, peak processor performance was limited to 
turd Support for Programming Languages and Systems, 
April 1991. 60 A comprehensive, successful solution to the memory 
Software techniques such as reordering and “vectoriza- bandwidth problem must therefore exploit the richness of 
tion” via library routines can improve bandwidth by reor- the full memory hierarchy, both its architecture and its 
dering requests at compile time. Such techniques cannot component characteristics. One way to do this is via access 
exploit run-time information and are limited by processor ordering, which herein is defined as any technique for 
register resources. 65 changing the order of memory requests to increase band- 
The traditional scalar processor concern has been to width. This is especially concerned with ordering a set of 
minimize memory latency in order to maximize processor vector-like “stream” accesses. 
RELATE BACK applications. 
This invention is a continuation-in-part of U.S. applica- 
tion Ser. No. 081340,740 filed Nov. 16, 1994 which is now 
abandoned. 
NASAGrant NAG-1242 and NSF Grant MIP-9307626. The 
government may have certain rights in the invention. 
BACKGROUND OF THE INVENTION 
1. Field of the Invention 
The instant invention relates to hardware-assisted access 
ordering to increase memory system performance for com- 
mercially available high-performance processors. 
15 matching. 
Although the addition of cache memory is often a 
20% by inadequate memory bandwidth. 
6,154,826 
3 
There are a number of other hardware and software 
techniques that can help manage the imbalance between 
processor and memory speeds. These include altering the 
placement of data to exploit concurrency, reordering the 
computation to increase locality, as in “blocking”, address 
transformations for conflict-free access to interleaved 
memory, software prefetching data to the cache, and hard- 
ware prefetching vector data to cache. 
Memory performance is determined by the interaction of 
its architecture and the order of requests. Prior attempts to 
optimize bandwidth have focused on the placement of data 
as a way of affecting the order of requests. Some architec- 
tures include instructions to prefetch data from main 
memory into cache, referred to as software prefetching. 
Using these instructions to load data for a future iteration of 
a loop can improve processor performance by overlapping 
memory latency with computation, but prefetching does 
nothing to actually improve memory performance. 
Moreover, the nature of memories themselves has 
changed. Achieving greater bandwidth requires exploiting 
the characteristics of the entire memory hierarchy; it cannot 
be treated as though it were uniform access-time RAM. 
Moreover, exploiting the memory’s properties will have to 
be done dynamically-essential information (such as 
alignment) will generally not be available at compile time. 
The difference between the foregoing prior art techniques 
and the instant disclosure is the reordering of stream 
accesses to exploit the architectural and component features 
that make memory systems sensitive to the sequence of 
requests. 
Reordering can optimize accesses to exploit the underly- 
ing memory architecture. By combining compile-time detec- 
tion of streams with execution-time selection of the access 
order and issue, the instant disclosure achieves near-optimal 
bandwidth for vector-like accesses relatively inexpensively. 
This complements more traditional cache-based schemes, so 
that overall effective memory performance need not be a 
bottleneck. 
SUMMARY OF THE INVENTION 
The method of rapid data accessing uses a data processor 
for processing information with memory for information 
storage in conjunction with a memory control device which 
controls the access of stored information from the memory. 
The memory control device is provided with temporary 
storage and decision ability which allows the memory 
control device to select an access order, prefetch and store 
the information. The temporary memory temporarily holds 
the prefetched information until required by the data pro- 
cessor. The information is subsequently sent to the data 
processor in the order required for use. The compiler detects 
the ability to use the memory control device in response to 
the requirements of the data processor for information stored 
in the memory. The decision ability determines the order to 
execute the memory accesses based on the location of stored 
data within the processor’s memory. The information is 
repeatedly accessed from memory and stored in the tempo- 
rary storage until all information is accessed and stored. The 
information is sent to the data processor, when requested, in 
the order required for use. The use of the memory control 
device to select the order in which to access information 
maximizes bandwidth and decreases the retrieval time. The 
information requirements detected by the memory control 
can be data vectors. The memory is can be multibank, 
interleaved or page-mode DRAMS. 
BRIEF DESCRIPTION OF THE DRAWINGS 
The advantages of the instant disclosure will become 
more apparent when read with the specification and the 
drawings, wherein: 
4 
FIG. 1 is a plan view of the Stream Memory Controller; 
FIG. 2 is a plan view of the architecture of the SMC 
FIG. 3 is a plan view of SMC VLSI implementation; 
FIG. 4 is a graph illustrating effective memory bandwidth 
FIG. 5 is a chart illustrating the performance effect of FIG. 
FIGS. 6 4  6b, 6c, 6d, 6e and 6f are graphical representa- 
FIG. 7 is a graph representation of hydro Long Vector 
FIGS. Sa, 8b, 8c, 8d, 8e and Sf are graphical representa- 
FIG. 9 is a graph representation of M-Long Vector 
FIGS. loa, lob, lOc, 10d, 10e and lOf are graphical 
FIGS. l l a ,  l lb,  l l c ,  l ld,  l l e  and llf are graphical 
representations of P1-Medium Vector Performance (for 
better nonSMC alignment); 
FIGS. 1 2 4  12b, 12c, 12d, 12e and 12f are graphical 
25 representations of R1-Medium Vector Performance (for 
FIGS. 13a, 13b, 13c, 13d, 13e and 13f are graphical 
FIGS. 14a, 14b, 14c, 14d, 14e and 14f are graphical 
FIGS. 15a, 15b, 15c, 15d, 15e and lSf are graphical 
FIGS. 16a, 16b, 16c, 16d, 16e and 16f are graphical 
FIGS. 17a, 17b, 17c, 17d, 17e and 17f are graphical 
representations of T1-Medium Vector Performance (for 
better nonSMC alignment); 
FIGS. 18a, 18b, 18c, 18d, 18e and 1Sf are graphical 
FIGS. 19a, 19b, 19c, 19d, 19e and 19f are graphical 
representations of T1-Long Vector Performance for a Dif- 
ferent SMC Vector Alignment; 
FIGS. 2 0 4  20b, 20c, 20d, 20e and 20f are graphical 
representations of T1-Medium Vector Performance for a 
Different SMC Vector Alignment; 
FIGS. 2 1 4  21b, 21c, 21d, 21e and 21f are graphical 
representations of T l S h o r t  Vector Performance for a 
FIGS. 2 2 4  22b, 22c, 22d, 22e and 22f are graphical 
representations of P2-Long Vector Performance; 
FIGS. 23a, 23b, 23c, 23d, 23e and 23f are graphical 
representations of P2-Medium Vector Performance (for 
FIGS. 24a, 24b, 24c, 24d, 24e and 24f are graphical 
representations of P 2 S h o r t  Vector Performance; 
FIGS. 25a, 25b, 25c, 25d, 25e and 25f are graphical 
representations of R2-hng Vector Performance; 
FIGS. 26a, 26b, 26c, 26d, 26e and 26f are graphical 
representations of R2-Medium Vector Performance (for 
better nonSMC alignment); 
FIGS. 27a, 27b, 27c, 27d, 27e and 27f are graphical 
FIGS. 28a, 28b, 28c, 28d, 28e and 28f are graphical 
board; 
5 
versus depth of unrolling; 
4; 
tions of the P1-Long Vector Performance; 
Performance When Bandwidth Scales With Interleaving; 
tions of R1-Long Vector Performance; 
Performance 
i o  
2o representations of T1-Long Vector Performance; 
better nonSMC alignment); 
representations of M-Medium Vector Performance; 
representations of P l S h o r t  Vector Performance; 
representations of R1-Short Vector Performance; 
35 representations of M-Short Vector Performance; 
3o 
40 representations of T1-Short Vector Performance; 
45 
so Different Vector Alignment; 
55 better nonSMC alignment); 
60 
65 representations of R2-Short Vector Performance; 
representations of T2-Long Vector Performance; 
6,154,826 
5 
FIGS. 29a, 29b, 29c, 29d, 29e and 29f are graphical 
representations of T2-Medium Vector Performance (for 
better nonSMC alignment); 
FIGS. 30a, 30b, 30c, 30d, 30e and 30f are graphical 
FIGS. 31a, 31b, 31c, 31d, 31e and 31f are graphical 
representations of P3-Long Vector Performance; 
FIGS. 32a, 32b, 32c, 32d, 32e and 32f are graphical 
representations of P3-Medium Vector Performance (for 
better nonSMC alignment); 
FIGS. 33a, 33b, 33c, 33d, 33e and 33f are graphical 
representations of P3-Short Vector Performance; 
FIGS. 34a, 34b, 34c, 34d, 34e and 34f are graphical 
FIGS. 35a, 35b, 35c, 35d, 35e and 35f are graphical 
representations of R3-Medium Vector Performance (for 
better nonSMC alignment); 
FIGS. 36a, 36b, 36c, 36d, 36e and 36f are graphical 
FIGS. 37a, 37b, 37c, 37d, 37e and 37f are graphical 
representations of T3-Long Vector Performance; 
FIGS. 38a, 38b, 38c, 38d, 38e and 38f are graphical 
representations of T3-Medium Vector Performance (for 
better nonSMC alignment); 
FIGS. 39a, 39b, 39c, 39d, 39e and 39f are graphical 
representations of T 3 S h o r t  Vector Performance; 
FIGS. 40a, 40b, 40c, 40d, 40e and 40f are graphical 
FIGS. 41a, 41b, 41c, 41d, 41e and 41f are graphical 
representations of P G M e d i u m  Vector Performance (for 
better nonSMC alignment); 
FIGS. 42a, 42b, 42c, 42d, 42e and 42f are graphical 
FIGS. 43a, 43b, 43c, 43d, 43e and 43f are graphical 
representations of R G L o n g  Vector Performance; 
FIGS. 44a, 44b, 44c, 44d, 44e and 44f are graphical 
representations of R4-Medium Vector Performance (for 
better nonSMC alignment); 
FIGS. 45a, 45b, 45c, 45d, 45e and 45f are graphical 
representations of R G S h o r t  Vector Performance; 
FIGS. 46a, 46b, 46c, 46d, 46e and 46f are graphical 
FIGS. 47a, 47b, 47c, 47d, 47e and 47f are graphical 
representations of T4-Medium Vector Performance (for 
better nonSMC alignment); 
FIGS. 48a, 48b, 48c, 48d, 48e and 48f are graphical 
FIGS. 49a, 49b, 49c, 49d, 49e and 49f are graphical 
representations of P5-Long Vector Performance; 
FIGS. 50a, Sob, ~ O C ,  Sod, 50e and 5Of are graphical 
representations of P5-Medium Vector Performance (for 
FIGS. 51a, 51b, 51c, 51d, 51e and 51f are graphical 
representations of P5-Short Vector Performance; 
FIGS. 52a, 52b, 52c, 52d, 52e and 52f are graphical 
representations of R5-Long Vector Performance; 
FIGS. 53a, 53b, 53c, 53d, 53e and 53f are graphical 
representations of R5-Medium Vector Performance (for 
better nonSMC alignment); 
FIGS. 54a, 54b, 54c, 54d, 54e and 54f are graphical 
FIGS. 55a, 55b, 55c, 55d, 55e and 55f are graphical 
representations of T 2 S h o r t  Vector Performance; S 
representations of R3-Long Vector Performance; 1s 
representations of R3-Short Vector Performance; 20 
2s 
representations of P4-Long Vector Performance; 30 
representations of P4-Short Vector Performance; 3s 
40 
representations of T4-Long Vector Performance; 4s 
representations of T 4 S h o r t  Vector Performance; so 
better nonSMC alignment); 5s 
60 
representations of R5-Short Vector Performance; 65 
representations of T5-Long Vector Performance; 
6 
FIGS. 56a, 56b, 56c, 56d, 56e and 56f are graphical 
representations of T5-Medium Vector Performance (for 
better nonSMC alignment); 
FIGS. 57a, 57b, 57c, 57d, 57e and 57f are graphical 
representations of T5-Short Vector Performance; 
FIGS. 58a, 58b, 58c, 58d, 58e and 58f are graphical 
representations of Al-Long Vector Performance; 
FIGS. 59a, 59b, 59c, 59d, 59e and 59f are graphical 
representations of Al-Medium Vector Performance; 
FIGS. 60a, 60b, 60c, 60d, 60e and 60f are graphical 
representations of Al-Short Vector Performance; 
FIGS. 61a, 61b, 61c, 61d, 61e and 61f are graphical 
representations of Varying Miss/Hit Cost Ratios on a Single- 
Bank Memory System; 
FIGS. 62a, 62b, 62c, 62d, 62e and 62f are graphical 
representations of Varying Miss/Hit Cost Ratios on a Two- 
Bank Memory System; 
FIGS. 63a, 63b, 63c, 63d, 63e and 63f are graphical 
representations of Varying MissiHit Cost Ratios on a Four- 
Bank Memory System; 
FIGS. 64a, 64b, 64c, 64d, 64e and 64f are graphical 
representations of Varying Miss/Hit Cost Ratios on an 
Eight-Bank Memory System; and, 
FIGS. 65a, 65b, 65c, 65d, 65e and 65f are graphical 
representations of hydro-Varying Miss Costs on Four 
Memory System. 
DETAILED DESCRIPTION OF THE 
INVENTION 
The instant invention discloses the use of hardware- 
assisted access ordering on a uniprocessor system. Using the 
instant disclosure with current memory parts and only a few 
hundred words of buffer storage, nearly the full peak band- 
width that the memory system can deliver can be consis- 
tently achieved. Moreover, this is done with naive code, and 
performance is independent of operand alignment. This 
technique combines compile-time detection of memory 
access patterns with a memory subsystem that decouples the 
order of requests generated by the processor from that issued 
to the memory system. This decoupling permits the requests 
to be issued in an order that optimizes use of the memory 
system. The approach involves detecting the pattern of 
future memory references in an algorithm at compile time 
then, using an analytic model of the memory, determining an 
optimal sequence of requests. The disclosed Smart Memory 
Controller (SMC) is used at execution time to issue “actual” 
memory requests in the order that maximizes bandwidth. 
As with any scalable performance architecture, the only 
possible solution is concurrency. At some level, independent 
memory subsystems must be provided whose aggregate 
bandwidth is sufficient even if that of the individual sub- 
systems is not. This is what parallel memory systems, both 
partitioned and interleaved have done for three decades. 
Unfortunately, as with scalable computing systems, con- 
current memory systems do not uniformly deliver their peak 
bandwidth, as both systems are sensitive to the order of 
requests. This is illustrated by the dot product example: 
do 10, i= l ,  n 
10 s=s+a(i) *b(i) 
Scalar code for this example involves fetching an alter- 
nating sequence of the a’s and b’s: <a(l), b(l), a(2), 
b(2), . . . >. Whether or not this sequence will achieve the 
maximum possible bandwidth from a given memory archi- 
tecture is problematic. In an interleaved memory system, if 
6,154,826 
7 
the arrays happen to begin on the same eveniodd boundary, 
the same module will be accessed twice in succession rather 
than alternating between them. This provides only a fraction 
of the possible bandwidth. 
Non-interleaved memory using page-mode DRAMS 
behaves similarly. These memory parts provide significantly 
faster access to data that is in the same row of the two- 
dimensional physical storage array. The sequence of alter- 
nating requests will flush the page-mode buffer on each 
request, thus negating the potential gains from this type of 
memory. 
In both of the above, the problem is due to the interaction 
of the memory architecture and the order of requests. 
Exploiting this interaction is the basis for the instant inven- 
tion. The program is processed in accordance with the 
following overall outline. 
The compiler reads the user program, translating the 
program to machine language. 
During compilation, the compiler detects the pattern of 
memory references which can utilize streaming; typically 
these will be a set of vector accesses, each of which can be 
characterized by a base address, stride, mode (read or write), 
and count. 
The compiler divides the code into streaming code which 
can take advantage of the SMC, and natural order code to be 
processed by the CPU in a convention manner. 
The streaming code is arranged to proceed the natural 
order code in the order of execution. 
At execution time the streaming code is coveyed to the 
Memory Scheduling Unit (MSU), which then initiates 
streamed data references. 
The streamed data is prefetched and buffered according to 
the instant disclosure, as set forth hereinafter. 
Simultaneously, the CPU is procesing the data in its 
natural order. 
Once all information has been accessed from standard 
memory and SMC, the information is “returned” as origi- 
nally requested. 
In order for the compiler to recognize and know to convey 
the required information to the MSU, a recurrence detection 
and optimization algorithm is utilized, for example as dis- 
closed by Davidson, Jack W., and Benitez, Manuel E., 
“Code Generation for Streaming: An AccessiExecute 
Mechanism”, Fourth International Conference on Architec- 
tural Support for Programming Languages and Operating 
Systems, April 1991, incorporated herein by reference. 
This can be illustrated using the above dot product 
example. In prior art interleaved memory systems, the 
processor will continue to issue its requests in the canonical 
order: <a(l), b(l), a(2), b(2), . . . >. In contrast, the SMC 
handles the foregoing example in two ways. If the arrays 
start on different boundaries, it will simply pass through the 
canonical request order, thereby providing full bandwidth. If 
the arrays start on the same boundary, however, the SMC 
will alternate pairs of requests: <a(l), a(2), b(l), b(2), 
a(3), . . . >, buffering one element from each array to allow 
the request to be supplied to the processor in the canonical 
order. 
If the memory system uses page-mode DRAMS, the SMC 
again utilizes two methods. In the likely case that the arrays 
are not in the same DRAM pages, the SMC’s optimal 
request sequence is alternating sequences of accesses to the 
same array, each sequence getting all of the data in a page. 
This method improves bandwidth by a factor of five on 
current memory chips. Additionally, modest amounts of 
buffering are adequate to achieve near-optimal performance. 
In the less likely case that the arrays overlap in the same 
S 
10 
1s 
20 
2s 
30 
3s 
40 
4s 
so 
5s 
60 
65 
8 
page, alternating requests between the arrays is possible, but 
the boundary conditions are subtle. Unless the arrays have 
the same number of elements per page, the boundaries 
behave like the previous non-overlapped case. 
The key to reordering requests at compile time is knowl- 
edge of the future. For typical applications this is difficult or 
impossible to know, however these are the applications for 
which traditional cache schemes work well. By contrast, 
scientific computations, where traditional caching is not as 
effective, are precisely those for which prediction of future 
references is possible. The instant invention works in con- 
junction with traditional caching to span a broader set of 
applications. 
In the two examples discussed above, if the physical 
starting addresses are known, the transformation would be 
performed at compile time. Generally, this is impossible for 
a variety of reasons, such as the code is a library function 
and cannot know its argument addresses at compile time. 
Thus, at least some of the request string transformations 
must be done at execution time. The typical role of the 
compiler is to determine the pattern of references and a set 
of possible transformations, the actual transformation must 
be selected at execution time. 
The data reference patterns in most scientific computa- 
tions can be described as an interleaved collection of 
accesses to vectors. Individual vector accesses can be 
described by a four tuple consisting of the name of the 
vector, the “stride” of the accesses (distance between vector 
elements), a count of the number of vector elements, and a 
“mode” (read or write). 
<name, stride, count, mode> 
If a particular tuple is denoted by a ai, then a general 
ai is an access pattern, 
if A is an access pattern, then a:n is an access pattern and 
denotes a repetitions of A n times, 
if A,, . . . ,4, are access patterns, then 
is an access pattern and denotes sequential execution of 
the pattern A, followed by the execution of pattern A,, 
etc. 
“access pattern”, can be defined as follows: 
{A,, . ’ ’ 2 4) 
Thus if ‘a’ and ‘b’ are vectors, an expression such as 
{A:2, B:3}:100 
denotes the access pattern 
a, a, b, b, b, a, a, b, b, b, a, . . . 
These expressions can be used to describe both the access 
pattern specified in the original scientific algorithm, and 
hence requested by the CPU, and the optimal access 
sequence for the memory. The simple form of these expres- 
sions make them easy to implement as an “instruction 
sequence” for the SMC. 
The SMC is generally applicable to any computing sys- 
tem having a processor which can perform non-caching 
loads and stores so that non-unit stride streams can be 
accessed without concomitantly accessing extraneous data 
and wasting bandwidth. For clarity within the specification, 
however, the description herein will be based on the archi- 
tecture illustrated in FIGS. 1-3. 
The instant invention was added to an Intel 2360, which 
was selected for its support of vector operations and non- 
cacheable floating point load and store instructions, which 
will be used to access stream operands. This has the disad- 
vantage that the stream buffers are external to the processor, 
and therefore incur a higher access cost than the internal 
6,154,826 
9 10 
cache. However, accesses to the stream buffers are fast is filled is determined by the MSU 12. In the case of stream 
enough that using the instant invention results in a signifi- read accesses, the FIFOs 16 are filled from the DRAM and 
cant performance increase. Utilization of the instant disclo- drained by the CPU 18. For stream writes, the FIFOs are 
sure on a computer having on-chip stream buffers would filled by the CPU 18 and drained to the DRAM. From the 
further decrease the access time. The Intel is60 is used s memory system’s point of view, each stream FIFO will be 
herein as an example and in no limits the scope of the implemented as a set of smaller FIFOS, or subFIFOs, one 
invention. per memory bank. The control logic must therefore fill (or 
The SMC 10 is comprised of the Memory Scheduling drain) the stream elements from a particular memory bank in 
Unit, MSU 12 and the Stream Buffer Unit, SBU 14. As stream order. This is not a significant restriction, however, as 
illustrated in FIG. 1, the computer’s memory 32 is interfaced i o  there will very rarely be any performance benefit from 
to the central processing unit, CPU 18 through the MSU 12. servicing these elements out of order. On the other hand, the 
The MSU 12 includes logic to issue memory requests as well subFIFO organization significantly reduces SMC 10 
as logic to determine the order of requests during streaming complexity, simplifying both the FIFO 16 status logic and 
computations. For non-stream accesses, the MSU 12 pro- the logic to determine the next stream request to the DRAM 
vides the same functionality and performance as a traditional is 26 and 28. 
memory controller. This is crucial, as the access-ordering In order to provide the flexibility necessary to explore 
circuitry of the MSU 12 is not the critical path to memory performance ramifications of different FIFO 16 
and in no way affects scalar processing. configurations, a virtual FIFO scheme is used having an 
The MSU 12 has full knowledge of all streams currently internal dual-ported SRAM (DP-SRAM 50) for storage. The 
needed by the processor and, given the base address, vector 20 depth and number of FIFOs 16 is thus limited only by the 
stride, and vector length, it can generate the addresses of all size of the implemented DP-SRAM 50. To provide 100% 
elements in a stream. The MSU 12 also knows the details of bus bandwidth between the CPU 18 and FIFOs 16 for 
the memory architecture, including interleaving and device pipelined, double-precision floating point loads and stores, 
characteristics. The access-ordering circuitry uses this infor- the SMC 10 must be able to provide a double word every 25 
mation to issue requests for individual stream elements in an zs ns, as the CPU 18 can supply a new quadword address every 
order that attempts to optimize memory system perfor- 50 ns. Since the DP-SRAM 50 (implemented in 1.2 mm 
mance. CMOS technology) used with the SMC 10 has an access 
The separate SBU 14, provides registers that the CPU 18 time on the order of 12 ns, two banks of interleaved 
uses to specify stream parameters (base address, stride, DP-SRAM are used to meet the bandwidth requirement. In 
length, and data size) and high-speed buffers for stream 30 order to service continuous, double-precision floating point 
operands. As with the stream-specific parts of the MSU 12, accesses with no wait states, the SMC 10 must also be able 
the SBU 14 is not on the critical path to memory, and the to accept a new address every 50 ns. This address is 
speed of non-vector accesses is not adversely affected by its presented to both banks of DP-SRAM 50, and two double 
presence. words are accessed. For reads, the first double word is sent 
There are a number of options for the internal architecture 3s directly to the processor and the second is latched, at 
of the SBU 14 and MSU 12 and the examples of organiza- pipeline 24, within the SMC 10 so that it can be sent on the 
tion disclosed herein should, in no way, limit the scope of the next bus cycle. For writes, the first double word from the 
invention. To discuss each and every architectural option processor is latched within the SMC 10, again at pipeline 24, 
would require an extensive number of pages and would be until the next cycle, when the second double word arrives. 
obvious to one skilled in the art based on the instant 40 Both double words are then written into the DP-SRAM 50 
disclosure. together. 
An example of the overall architecture of the SMC 10 in The SMC VLSI 15 implementation, shown in FIG. 3, 
relation to the host board is shown in FIG. 2. The host consists of several FIFO Control State Machines 44 and 
processor board 30 contains a 40 MHz microprocessor CPU ControliStatus (CSC) registers 46 and are parts of the SBU 
18, and a 2-way cache optimized interleaved 16 MB 4s 10. In addition to storing the stream parameters (base, 
memory system 22. length, and stride), the CSC registers 46 govern the read/ 
The SMC board 10, which contains the MSU 12 and SBU write modes of the individual FIFOs 16 and provide a 
14 within the VLSI chip 15, is connected to the processor user-accessible reset control for the entire SMC 10. 
board 30 via an expansion connector 36. The SMC board 10 The Processor Bus Interface (PBI) state machine 48 is 
consists of the SBU 14, the SMC control logic, several data SO responsible for handling all handshaking between the SMC 
path elements, and two interleaved banks of DRAM main 10 and the CPU 18, interfacing all requests from the SMC 
memory 26 and 28. board 10 memory, including stream, scalar, and cache line 
In processors having a maximum latency of 11 ns for the accesses. 
address and cycle definition lines, accesses to the SMC The FIFO state machine 44 maintains pointers to the 
board 10 are pipelined due of this high latency. Further delay ss virtual FIFOs 16 contained in the DP-SRAM 50, as well as 
is encountered as the signals travel to the expansion con- status signals on the condition of each (full, empty, half full, 
nector from the CPU 18, making the signals available at the etc.). The FIFO state machine 44 allows simultaneous access 
edge of the SMC board only 10-12 ns before the (40 MHz) to the FIFO DP-SRAM 50, so that the SMC 10 bank 
clock edge. A pipeline stage is used to latch these signals, controllers 52 and 54 and the processor 18 can access the 
thereby increasing the available time to access the SMC 60 FIFOs 16 concurrently. This capability is necessary for the 
board 10 within the next cycle. The onboard cache- SMC bank controllers 52 and 54 to keep pace with the CPU 
optimized memory system is similarly pipelined. The pipe- 18’s stream requests. 
line stage 24 is required on the 2360, as well as computers The SMC 10 has low-skew clock distribution trees built 
with similar architectures, however whether or not pipelin- into its architecture, but the fixed delay in the clock as it is 
ing is required will be evident to those versed in the art. 65 driven onto the SMC 10 might be as great as 6 to 8 ns, which 
The high-speed memory of the SBU 14 is implemented is unacceptable for the high-speed design of 40 MHz or 
logically as a set of FIFOs 16. The order in which the buffer greater. The SMC 10 therefore uses a phased locked loop 
6,154,826 
11 
(PLL) to synchronize its on-chip clock with the system 
clock. In this design, the reference signal for the PLL is 
connected to the clock driven off of the SMC 10 from its 
distributed clock tree, and the locked signal is fed back to the 
input of the SMC clock network. Clock synchronization 
within 1 ns is possible using this approach. 
A set of memory-mapped registers provides a processor- 
independent way of specifying stream parameters. Setting 
these registers at execution time allows the CPU 18 to 
initiate an asynchronous stream of memory access opera- 
tions for a set of string operands. Data retrieval from the 
streams (loads) and insertion into streams (stores) is done in 
any of several ways. For example, the SBU 14 could appear 
to be a traditional cache, or alternatively, the model would 
include a set of FIFOs 16. In this organization, each stream 
is assigned to one FIFO 16, which is asynchronously filled 
from, or drained to, memory by the accessiissue logic. The 
“head” of the FIFO 16 is another memory-mapped mapped 
register. Load instructions from, or store instructions to, a 
particular stream will reference the FIFO 16 head via this 
register, dequeueing or enqueueing data as is appropriate. It 
should be noted that the use of DRAM on both the host and 
the SMC boards is shown for illustration in this application. 
In a preferred embodiment all of the memory would be in 
one location and accessible from either the SMC or cache. 
Traditional caches retain their importance for code and 
non-vector data in a system equipped with an SMC 10. 
Furthermore, if algorithms can be blocked and data aligned 
to eliminate significant conflicts, the cache and SMC can be 
used in a complementary fashion for vector access. Under 
these conditions multiple-visit vector data can be cached, 
with the SMC 10 used to reference single-visit vectors. To 
illustrate this, consider implementing the matrix-vector mul- 
tiply operation: 
y=(A+B)x 
where Aand B are nxm matrices and y and x are vectors. The 
code for a straightforward implementation using matrices 
stored in column-major order is: 
When the computation is strip-mined to reuse elements of 
y the code changes to: 
do 30 lTa,n,IS 
load y(IT) through y(min(n,IT+IS-1)) into cache 
do 20 j = l ,m 
load xfi) into processor register 
do 10 I - IT,min(n,IT+IS-1) 
~ ( 9  = y(i) + ( M j )  + B(ij)) * xo’) 
10 continue 
20 continue 
30 continue 
Partition size depends on cache size and structure. Ele- 
ments of “y” are preloaded into cache memory at the 
appropriate loop level, and the SMC 10 is then used to 
access elements of “A” and “B”, since each element is 
accessed only once. The reference to “x” is a constant within 
the inner loop, and is therefore preloaded into a processor 
register. 
S 
10 
1s 
20 
2s 
30 
3s 
40 
4s 
so 
5s 
60 
65 
12 
Although the SMC 10 provides near-optimal bandwidth 
for a given memory architecture, algorithm, and data 
placement, it cannot compensate for an unfortunate place- 
ment of operands. For example, a vector stride that results in 
all elements placed in a single bank of a multi-bank memory. 
The SMC 10 and data placement are complementary and the 
SMC 10 will perform better given good operand placement. 
To illustrate one aspect of the bandwidth problem, as 
discussed with respect to the tridiag hereinafter, and how it 
might be addressed at compile time, the effect of executing 
the fifth Livermore Loop (tridiagonal elimination) using 
non-caching accesses to reference a single bank of page- 
mode DRAMS is shown. This computation occurs fre- 
quently in practice, especially in the solution of partial 
differential equations by finite difference or finite element 
methods. Since it contains a first-order linear recurrence, it 
cannot be vectorized. Nonetheless, the compiler can employ 
the recurrence detection and optimization algorithm dis- 
closed by Davidson, supra to reorder the request and utilize 
the prefetching capabilities of the SMC 10. This algorithm 
generates streaming code where each computed value Xi is 
retained in a register so that it will be available for use as 
Xi-1 on the following iteration. For medium or long vectors, 
elements from “x”, “y”, and “z” are likely to reside in 
different pages, so that accessing each vector in turn incurs 
the page miss overhead on each access. The natural refer- 
ence sequence for a straightforward translation of the com- 
putation: 
Y Xl-21 x (Yl - Xl-1) 
i s  shown as: 
loop: loop: - - 
load z[i] load z[i] 
load y[i] load z[i + 11 
stor x[i] load y[i] 
jump loop load y[i + 11 
stor x[i] 
stor x[i + 11 
jump loop 
( 4  (b) 
The memory references likely to generate page misses in 
the above tridiag code would be: load z[i], load y[i], stor x[i] 
of loop (a) and load z[i], load y[i] and stor x[i] of loop (b). 
In the loop (a), a page miss occurs for every reference. 
Unrolling the loop and grouping accesses to the same vector, 
loop (b), amortizes the page-miss cost over a number of 
accesses; in this case three misses occur for every six 
references. 
Reducing the page-miss count increases processor- 
memory bandwidth significantly. For example, consider a 
device for which the time required to service a page miss is 
four times that for a page hit, a missihit cost ratio that is 
representative of current technology. The natural-order loop 
in (a) only delivers 25% of the attainable bandwidth, 
whereas the unrolled, reordered loop (b) delivers 40%. 
There are other factors, such as bus limitations, that could 
affect effective memory bandwidth, but they are ignored 
here for the sake of simplicity. 
FIG. 4 illustrates effective memory bandwidth versus 
depth of unrolling, given access times of 160 ns for page 
misses and 40 ns for page hits. The bottom curve, the loop 
body (a), is essentially replicated the appropriate number of 
times, as is standard practice. In the middle curve, accesses 
have been arranged as per loop (b). The top curve depicts the 
bandwidth attainable if all accesses were to hit the current 
6,154,826 
13 14 
DRAM page. Reordering the accesses realizes a perfor- 
mance gain of almost 130% at an unrolling depth of four, 
and over 190% at a depth of eight. A size register which 
mance by approximately 240%. 
The performance effect of FIG. 4 is illustrated in FIG. 5 .  
As illustrated, performance using the instant invention on 
very short vectors is about 2.5 times that of a system without 
about triple that of the non-SMC system and for long vectors 
and deep FIFOs, bandwidth reaches 98.5% of peak. 
As the foregoing illustrates, the performance benefits of 
doing such static access ordering can be quite dramatic. 
However, without the kinds of address alignment informa- 
generate the optimal access sequence. Additionally, the 
extent to which a compiler can perform this optimization is 
further constrained by such things as the size of the proces- 
sor register file, for instance tridiag can be unrolled at most 
eight times On the cpu 18. The SMC lo provides the 20 
compiler, altered as stated heretofore, with the addressing 
assistance required to generate optimal access sequence. 
TAXONOMY 
dynamically and independently. Determining access order 
dynamically allows the MSU 12 to optimize behavior based 
on run-time interactions. 
s ordering without compiler support by augmenting the pre- 
vious controller with logic to induce stream parameters, 
Whether or not such a scheme is superior to another system 
depends on the relative quality of the compile-time and 
ware costs, P~~~~~~~~ for ‘‘vector prefetch units>> have 
recently appeared, but these do not order accesses to fully 
exploit the underlying memory architecture, 
Based on analysis and simulations, the best engineering 
choice is to detect streams at time, but to defer 
ing this scheme Over an (RT,RT,RT) system follows a 
phi~osophy that has guided the design of RISC processors, 
that is to time whenever possible, 
This speeds processing and helps minimize hardware, 
This organization is both simple and practical from an 
implementation standpoint, The FIFO 16 organization uti- 
lized herein is close to the “stream units” of the WM 
architecture as disclosed in Wulf, W. A,, “Evaluation of the 
allows an unrolling depth Of sixteen would improve perfor- Fully dynamic (RT,RT,RT) systems implement access 
an SMC 10. Performance on moderate length vectors is run-time algorithms for stream detection and relative hard- 
tion usually only available at run time, the compiler can’t 15 access ordering and issue to run time-(CT,RT,RT), Chaos- 
work to 
‘Ihere are a number Of Options for when and how WM Architecture”, 19th Annual International Symposium 
set forth the 25 on Computer Architecture, May 1992, which is incorporated 
herein as if cited in full. The FIFO 16 organization as 
disclosed herein can be considered a special case of a 
decoupled access-execute architecture. Goodman, Jr. R., et 
al, “PIE: A VLSI Decoupled Architecture”, Twelfth Inter- 
accessed within a loop, along with their Parameters 30 national Symposium on Computer Architecture June 1985 
(base address, stride, etc.); and Smith, J. E. et al, “The ZS-1 Central Processor”, the 
access ordering (AO): the determination of interleaving of Second International Conference on Architectural Support 
stream references that most efficiently utilize the for Programming Languages and Systems, October 1987, 
memory system; and which are incorporated herein as if cited in full. An Appa- 
access issuing (AI): the determination of when the load/ 35 ratus for Reading to and Writing from Memory Streams of 
store operations will be issued. Data While Concurrently Executing a Plurality of Data 
Each of these functions may be addressed at compile time, Processing Operations is disclosed in U.S. Pat. No. 4,819, 
CT, or by hardware at run time, RT. This taxonomy classifies 155 to Wulf et a1 and is incorporated herein as if cited in full. 
access ordering systems by a tuple (SD,AO,AI) indicating The disclosed combination hardwareisoftware scheme does 
the time at which each function is performed. 40 not require heroic compiler technology, as the compiler need 
Some prior art systems detect streams at compile time, only detect the presence of streams, which can be accom- 
while others derive access-ordering algorithms relative to a plished through the use of streaming algorithms. 
precise analytic model of memory systems. The second 
approach unrolls loops and orders memory operations to 
exploit architectural and device features of the target 45 
Provid- 
ing some increase in bandwidth, is limited by the size of the 
processor register file and lack of vector alignment infor- FIFO depth, 
mation available at compile time. 
The SMC 10 can further increase the bandwidth utiliza- SO vector length’ stride’ and 
tion of the (CT,CT,CT) system by providing buffer space and dynamic 
automating vector prefetching to produce a (CT,CT,RT) number of memory modules, 
system. The MSU 12 relieves register pressure and DRAM speed, 
decouples the sequence of accesses generated by the CPU 18 The results involve the following restrictions: 
from the sequence observed by the memory components. ss All memories modeled here consist of interleaved banks 
The compiler recognizes the sequence of vector references of page-mode DRAMS, where each page is 2K double- 
to be issued and buffered, but the actual access issue is words. 
executed by the MSU 12. The DRAM page-miss cycle time is four times that of a 
Both of these solutions, however, are static in the sense DRAM page hit, unless otherwise noted. 
that the order of references seen by the memory is deter- 60 NonSMC results are for the “natural” reference sequence 
mined at compile time since static techniques are inherently for each benchmark, using non-caching loads and stores. 
limited by the lack of alignment information. Dynamic SMC initialization requires two writes to memory- 
access ordering systems introduces logic to determine the mapped registers for each stream; this overhead has no 
interleaving of a set of references. significant effect on results, and has not been factored into 
developed at compile time and sent to the MSU 12 at run The onboard memory, which is optimized for cache line 
time, where the order of memory references is determined access (loads and stores of four 64-bit double words), 
Ordering can be done, therefore the 
Access Ordering systems can be by three key 
Stream detection (SD): the recognition Of Streams 
taxonomy relied upon herein. 
components: 
TESTING PARAMETERS 
Tests were conducted by simulating a wide range of SMC 
10 configurations, wherein the following factors were var- 
ied: 
The (cT,cT,cT) 
pO1icy~ 
In a dynamic (CT,RT,RT) system, stream descriptors are 65 the following tests. 
6,154,826 
15 16 
provided a basis of comparison. Initially, vector algorithm the Level 3 BLAS libraries, as well as the matrix-multiply 
test cases were run out of the onboard memory to obtain by diagonals operation mentioned above (which uses 
base-line timing information. These real-time results are vaxpy). Whether or not the vectors are reused has no bearing 
compared herein with those of the same algorithms run out on SMC performance, although lack of temporal locality 
of the SMC-controlled memory. The cache-optimized 5 greatly diminishes the effectiveness of caching. The ability 
onboard memory provides functionality (such as parity, to obtain good memory performance, even for computations 
error correction, and cache snooping capabilities) that the that do not benefit from caching, is one of the main attrac- 
SMC-controlled memory will not, which may have affected tions of the instant invention. 
the overall timing results. This consideration was factored The results for mu1 and msort are not addressed here. The 
many simulations indicate that the performance curves for into the comparisons. 
The processor is modeled as a generator of load and store the other benchmarks are remarkably similar. This similarity 
requests only and arithmetic and control are assumed never results from the SMC’s ability to reorder accesses, regard- 
to be a computational bottleneck. This places the maximum less of the access pattern expected by the processor. 
stress on the memory system by assuming a computation As the SMC 10 exploits the underlying memory archi- 
rate that out-paces the memory’s ability to transfer data. tecture to issue accesses in an order that optimizes memory 
Scalar and instruction accesses are assumed to hit in the bandwidth, for any memory system composed of interleaved 
cache for the same reason. banks of DRAM components, there are at least two facets to 
Utilized Computations this endeavor. One is taking advantage of the available 
concurrency among the interleaved banks, the other taking 
advantage of the device characteristics. At each “decision 
20 point” (i.e. each available memory bus cycle), the SMC 10 
must decide how best to achieve these goals. 
The algorithm design space, in the example disclosed 
herein, is divided into two subspaces. The first subspace 
being algorithms that first choose a bank (bank-centric 
25 schemes), and the second subspace being algorithms that 
first choose an access (access-centric schemes). The follow- 
ing is based on a memory composed of interleaved banks of 
p a g e - m o d e  D R A M S  a n d  a F I F O - b a s e d  S B U  
implementation, as depicted in FIG. 1. 
In these schemes, each bank operates independently, thus 
each may be on a different DRAM page at any given time. 
This kind of memory architecture differs from traditional 
prior art interleaving schemes, where each bank “listens” to 
the page address for each access, but only one bank responds 
Daxpy, copy, scale, and swap are from the BLAS (Basic A bank-centric algorithm for choosing the next access 
Linear. Algebra Subroutines). These vector and matrix com- 
putations occur frequently in scientific computations and select the memory bank($ to which the next access(es) 
have been collected into libraries of highly optimized rou- will be issued, and 
tines for various host architectures. Hydro and tridiag are the 40 choose an appropriate access from the pool of ready 
first and fifth Livermore Loops, a set of kernels culled from accesses for each memory bank (this is equivalent to 
important scientific computations. The former is a fragment selecting a FIFO to service). 
of a hydrodynamics computation, and the latter is a tridi- As used herein a ready access refers to an empty position 
agonal elimination computation. Vaxpy is a vector axpy in a read FIFO 16 (that position is ready to be filled with the 
computation that occurs in matrix-vector multiplication by 45 appropriate data element) or a full position in a write FIFO 
diagonals. This algorithm is useful for the diagonally sparse 16 (the corresponding data element is ready to be written to 
matrices that arise frequently when solving parabolic or memory). 
elliptic partial differential equations by finite element or Bank Accessing 
finite difference methods. Mu1 is a sparse matrix multiply, Once the FIFO 16 to service has been determined, the 
and msort is a merge sort. SO selection mechanism chooses an appropriate bank from the 
Herein “axpy” refers to a computation involving some set of banks servicing that FIFO 16. The possible candidates 
entity, “a”, times a vector “x” plus a vector “y”. For daxpy, are those banks that are presently idle. Since there may be 
“a” is a double-precision scalar, so the computation is fewer banks than potential accesses, a set of available banks 
effectively a scalar times a vector, plus another vector. In the is determined, and then access considered only to those 
case of vaxpy, “a” is a vector, making the computation a ss banks. Strategies for selecting banks vary in the number of 
vector times a second vector, plus a third vector. banks accessed at a time, and in how many banks considered 
These benchmarks were selected because they represent in the search. At one end of the spectrum lies the exhaustive 
access patterns found in real scientific codes, including the search strategy, to keep looking until the appropriate number 
inner-loops of blocked algorithms. These benchmarks con- of banks is found or no unexamined banks remain. At the 
stitute a representative subset of all possible access patterns 60 other end of the spectrum, only one bank is considered. 
for computations involving a small number of vectors These schemes must also impose an ordering on the banks 
(computations requiring more vectors can usually be broken to determine which will be considered first. 
down into several parts, each using only a small number of The three bank-selection schemes simulated herein are 
vectors). Parallel Access Initiation (P), Round-Robin Selection (R) 
elements, they are often found in the inner loops of algo- In the first scheme, Parallel Access Initiation (P), it is 
rithms that do. Examples include the blocked algorithms of attempted to initiate accesses to all available (non-busy) 
10 
The benchmark kernels used herein are: 
copy: v i y,  t x, 
daxpy: V i  y,  t ah., + y,  
hydro: V i  x, t q + y , x ( r x v c , + ~ ~ + r x v c , + ~ ~ )  
scale: V i  x, t ah., 
swap: V i  t m p  t y,  y ,  t x, x, t t m p  
tridiag: V i  x, t z, x ( y ,  - x , - I )  
vaxpy: V i  y,  t a,x, + y,  
msort: merge sort 
30 
mul: sparse matrix multiply 
35 to the request. 
must: 
Although these computations do not reuse vector 65 and Token Round-Robin Selection (T). 
6,154,826 
17 18 
banks. This greedy algorithm attempts to take full advantage the page overhead now. In the event that these algorithms 
of available concurrency, but is generally impractical to find no valid candidates, they either choose the next FIFO in 
implement, since it requires a separate bus to each bank. sequence, or do nothing until the next decision-making time. 
Although it appears that this algorithm should perform at There are several possibilities for prioritizing the FIFOs. 
least as well as any other, it isn’t always the case. In general, s They can be considered in random order; imposed a fixed 
the interaction between memory bank availability, access order, always considering a given FIFO first; given Priority 
initiation, and processor activity is quite complex and the to reads (or writes); started with the last FIFO the selected 
bank serviced; or started with the last FIFO any bank results unobvious. 
In the Round-Robin Selection (R) scheme, only one serviced. The latter two options seem most fair and reason- 
able from an implementation standpoint. The first of these access is initiated, however each bank is considered in turn i o  encourages different banks to be working on different until an available one is found or there are no more banks FIFOs, while the second encourages several banks to be 
left. In a balanced system, where the number of banks is working on the Same FIFO, It is not intuitively obvious 
matched to the memory speed, Scheme R essentially stag- which of these will yield better performance, 
gers the accesses, so that it performs similarly to Scheme p, The following ten FIFO-selection algorithms were chosen 
but with slightly greater latency. The advantage of this 15 spanning the design space and conducted numerous simu- 
algorithm is lower implementation cost, since the bandwidth 
requirements between the SMC and memory are lower than 
for Scheme P. 
In the last scheme, Token Round-Robin Selection (T), 
again only one access is initiated, and only the next bank in 20 
sequence is considered. If that bank is busy, nothing is done 
at the current time. This is the easiest and least expensive to 
implement of the three algorithms. In spite of Scheme T’s 
simplicity, its performance rivals and sometimes exceeds 
that of Scheme P and Scheme R. 
For the Scheme R and Scheme T approaches, the most 
reasonable strategy is to start with the next bank in sequence 
after the bank to which the last access was initiated. Starting 
with a fixed bank each time would cause some banks to be 
under-used and accesses to those banks would effectively 30 
have lower priority. 
FIFO Algorithms 
lations for each of bank- and F I F O - ~ ~ ~ ~ ~ ~ ~ ~ ~  
schemes. The following algorithms should be considered as 
examples an in no way limit the scope of the invention. 
2s 
The FIFO-selection algorithms vary in sophistication, 
1 look for page hit; if none, choose fullest 
writeiemptiest read subFIFO, 
search round-robin, starting with last FIFO ac- 
cessed by current bank 
look for page hit; if none, choose fullest 
writeiemptiest 
read subFIFO that’s at least L/z fulliempty; if 
none, choose next access found 
search round-robin, starting with last FIFO ac- 
cessed by current bank 
look for page hit; if none, choose fullest 
writeiemptiest 
read subFIFO that’s at least Yz fulliempty; if 
none, do nothing, 
search round-robin, starting with last FIFO ac- 
cessed by current bank 
look for page hit; if none, choose next access 
found, search round-robin, starting with last 
FIFO accessed by current bank 
search round-robin, starting with last FIFO ac- 
cessed by current bank 
look for page hit; if none, choose fullest 
writeiemptiest read subFIFO 
search round-robin, starting with last FIFO ac- 
cessed by any bank 
look for page hit; if none, choose fullest 
writeiemptiest read FIFO 
search round-robin, starting with last FIFO ac- 
cessed by current bank 
look for page hit; if none, choose fullest 
writeiemptiest read FIFO 
search round-robin, starting with last FIFO ac- 
2 
3 
ranging from those that use all available information to 
decide what to do next, to those that do the easiest and 35 
Some algorithms first look for an access that hits the 
bank’s current DRAM page. Others simply choose the next 
FIFO in round-robin order, regardless of whether the next 
access from that FIFO hits the current page. 
If an algorithm that looks for a page hit can’t find one, 
there are several ways to chose the next access. One is to 
look for a “best” candidate based on how full (empty) the 
overhead will be incurred, it is optimal to amortize that cost 45 
over as many page-hits as possible, hence choosing a FIFO 
for which there will be many accesses to the new DRAM 
4 
5 choose next access 
quickest thing they can. 
6 
40 
7 
read (write) FIFOs are. Since it is known that the page-miss 
8 
cessed by any bank 
look for page hit; if none, choose next access 
found 
search round-robin, starting with last FIFO ac- 
page. Other algorithms simply choose the next FIFO in 
9 sequence when they can’t find a page-hit. 
When trying to decide which FIFO is “best” to service SO 
next, the algorithm may consider the total contents of the cessed by any bank 
10 choose next access 
search round-robin, starting with last FIFO ac- 
cessed by any bank 
FIFO (this is the global view), or it may restrict itself to just 
resuonsible, referred to as a subFIFO (this is the local view). 
the portion of the FIFO for which the current bank is 
Some algorithms require that a FIFO (subFIF0) meet ’a ss 
certain “threshold” in order to be considered for for 
instance, an algorithm might require that a read FIFO 
(subFIF0) be at least half empty before it can be considered 
among the best candidates for the next access. The rationale 
for this sort of restriction springs from the overhead 60 
involved in accessing a new DRAM page, whenever 
should be amortized on as many as possible, ~f 
there are sufficiently few ready accesses to a given page, it 
may be worthwhile to wait until the processor has generated 65 
more accesses to that page (by removing elements from the 
read FIFO or writing elements to the write FIFO) than to pay 
Each pair-wise combination of bank-selection and FIFO- 
selection algorithms (P1 through T10) describes a particular 
bank-centric Ordering scheme. 
Access Ordering Schemes 
algorithms, two naive access-centric ordering schemes were 
looks at each FIFO in round-robin order, issuing accesses for 
the Same 
not all elements of the stream have been accessed, and 
there is room in the FIFO for another read operand, or 
another write operand is present in the FIFO. 
In addition to the above bank and 
DRAM pages must be switched, the cost of that miss Over Scheme A1 is simp1e: the SMC 
stream 
6,154,826 
19 20 
Scheme A2 is similar, except it incorporates the notion of 
a threshold into the decision whether to continue servicing Algorithm P1 
the same FIFO: accesses that incur page-misses will only be As previously stated, this ordering algorithm attempts to 
issued to the current FIFO if it is empty enough (for a read initiates an access to each idle bank at every available bus 
FIFO) or full enough (for a write FIFO), otherwise each s cycle. For each memory bank “b”, it examines the FIFOs in 
Group 1-Algorithms P1, R1, and T1 
FIFO in sequence is evaluated according to the same criteria. 
If none is found to meet the threshold, no access is initiated 
at that time. 
Each of these fifteen access ordering schemes was run on 
a a single-bank system and interleaved systems of two, four, 
and eight banks. Simulation results for the remaining five 
FIFO-selection algorithms are extremely similar, therefore a 
brief summary of their comparative performance is pro- 
vided. 
Vector Length 
round-robin order, beginning with last FIFO for which an 
access to “b” was initiated. If it finds an access that hits the 
current DRAM page, it issues that access. If no accesses for 
the bank hit the current DRAM page, then an access is issued 
i o  for the FIFO requiring the most service from b. The perfor- 
mance for the Pl’s algorithm is illustrated in FIGS. 6, 7, 11 
and 14. 
FIG. 6 and FIG. 7 show SMC performance for vectors of 
10,000 elements as a function of FIFO depth and number of 
is memorv banks. Most of the results aresented here will be as a 
These results are for the seven benchmark algorithms set in FIG. 6, where performance is given as a percentage of 
forth heretofore, run on long (10,000-element), medium normalized peak bandwidth. Results for memory systems 
(100-element), and short (10-element) vectors. The hydro with a greater number of modules represent a percentage of 
and tridiag benchmarks share the same access pattern, thus a larger bandwidth. The bottom curves in FIG. 6 depict the 
their results for these simulations are identical, and are 20 bandwidth attained by the analogous nonSMC systems. On 
presented together in each figure. the daxpy benchmark, for example, an SMC system with 
10,000 have been chosen elements as the “long” vectors, two memory banks achieves 97.8% of peak bandwidth, 
although much longer vectors (on the order of millions of compared to 18.7% for a nonSMC system. In general, SMC 
elements) certainly exist in practice. These vectors are long systems with deep FIFOs achieve in excess of 94% of peak 
enough that SMC startup transients become insignificant zs bandwidth for all benchmarks and memory configurations. 
and therefore performance for million-element vectors is not The only exception is tridiag, which attains 91% of peak on 
expected to be materially different. An additional advantage the four-bank system, and 85% of peak with eight banks. 
in choosing a length of 10,000 as opposed to one million is Even with FIFOs that are only sixteen double-words deep, 
the effects of context switches when using an SMC in a the SMC systems consistently deliver over 80% of the 
multiprogrammed environment. An example would be a 30 attainable bandwidth. Again, the tridiag benchmark is the 
hypothetical RISC system running at 50 MHz, executing an exception, where SMC systems with sixteen-deep FIFOS 
average of one instruction per 20 ns clock cycle. If such a achieve over 73% of peak. 
system incurred a context switch about one hundred times a The performance differences between tridiag and the 
second, it could execute roughly 500,000 instructions other kernels stem from its access pattern: it uses three 
between context switches. Therefore the system would rea- 3s vectors, but accesses each only once per iteration. Vaxpy 
sonably be expected to perform on the order of 10,000 also involves three vectors, but it splits the “y” vector into 
iterations of an inner loop (up to 50-instructions) between two streams, read and write. This reuse gives it a lower 
context switches. Thus the choice of “long” vector length is percentage of page misses for the SMC to amortize. 
appropriate in that it is long enough that startup transients Similarly, copy and scale are distinguished by the presence 
have essentially no effect on performance, and short enough 40 in the latter of a vector that is both read and written. 
that the vectors represent an amount of work that might Increasing the number of banks reduces relative 
reasonably be accomplished between context switches. performance, an unanticipated and unobvious effect. This is 
Table Parameters due in part to keeping both the peak memory system 
Unless otherwise indicated, a negative entry indicates that bandwidth and the DRAM page-misshit delay ratio con- 
the first (single bank) alignment yielded better performance. 4s stant. Thus, the eight-bank system has four times the DRAM 
Values of magnitude greater than 1% are rounded to the page-miss latency of the two-bank system. Although the 
nearest tenth. For entries of lesser magnitude, the Tables percentage of peak bandwidth delivered for the architectures 
contain only the sign of the difference. Blank entries indicate with greater interleaving is smaller, the total bandwidth is 
that differences, if any, are less than one hundredth of one much larger. If, alternatively, the page-miss cycle time of the 
percent. SO memory components is held constant, the page-hit cycle 
time decreased, with a faster bus, the peak bandwidth of the 
total system increases proportionally to the number of TESTING 
FIGS. 6, 7, 8, 9 and 10 show SMC performance for long banks, 
vectors as a function of FIFO depth and number of memory FIG. 7 illustrates SMC performance on the hydro bench- 
banks compared to the analogous nonSMC systems. For ss mark when the page-miss cycle time of the memory com- 
same bank. age of the peak bandwidth of a single-bank memory system 
with corresponding horizontal lines indicating peak band- 
medium vectors compared to the analogous nonSMC width for each architecture. The benchmark achieves a 
memory systems, but here the vectors used for the nonSMC 60 noticeably lower percentage of total bandwidth for the four- 
results have a better alignment: the ith vector begins in bank and eight-bank architectures. Increasing the number of 
(i mod n), where n is the total number of banks. banks decreases the total number of accesses to each bank, 
FIGS. 14,15 and 16 illustrate SMC performance on very thus page-miss costs are amortized over fewer accesses. 
short (10-element) vectors. NonSMC performance is as Performance of nonSMC systems is independent of vector 
depicted in the long or medium vector graphs, depending on 65 length. Since these systems employ no dynamic access 
vector alignment. For clarity, the nonSMC lines have been ordering, the number of requests issued and the resulting 
omitted from these graphs. percentage of total bandwidth obtained are constant for each 
these simulations, all vectors are aligned to begin in the ponents is held constant, Performance is given as a percent- 
FIGS. 11, 12, and 13 depict SMC performance for 
6,154,826 
21 22 
loop iteration. This is true of any system in which access design of the SBU. In the parallel scheme, there is a separate 
issue is determined at compile time, including those that use bus to each memory bank, allowing the SMC to initiate 
prefetching. several accesses at a time. The SBU in the SMC described 
FIG. 11 depicts the results of simulating selection algo- herein can only Process one data value at a time, due to the 
rithm p1 on benchmarks using vectors of 100 elements, fact that the FIFOS must be dual-ported in order to allow 
These SMC results depict the net effect of two competing simultaneous access by both the c p u  and the MSU. ImPle- 
performance factors, With deeper FIFOs, DRAM page menting an efficient FIFO to allow more than two simulta- 
misses are amortized over a larger number of total accesses, neous accesses would be much more difficult, and would 
which can increase performance, At the Same time, the consume substantially more chip real estate. Thus, read 
processor has to wait longer to complete its first loop accesses completing simultaneously are effectively 
iteration while the SMC prefetches numerous operands to be 10 serialized, since all but one of them is delayed until the next 
used in the following loop iterations. This can decrease cycle. Likewise, the SMC can only write one value each bus 
performance, as evidenced by the tail-off beyond depth-32 cycle. This has the effect of staggering the initiation of 
FIFOS. Optimum FIFO depth should be run-time selectable accesses to the different banks, SO that the Parallel aka- 
in the SMC, since it is so closely related to stream length. rithms end UP behaving much like the greedy ~ ~ n d - r o b i n  
Lack of dynamic ordering renders the performance of 15 ?PPFOQcheS. In view of these limitations, a Parallel access- 
efit only if the SBU were able to Process several data values 
at once, Or if it Processed them serially, but with a cycle time 
much faster than that of the memory buses. 
nonSMC systems particularly sensitive to vector placement, initiation scheme would afford substantial performance ben- 
In the graphs depicting long-vector SMC performance, the 
vectors are aligned so that they all compete for the same 
bank on each iteration. This has little effect on SMC per- 
formance since it reorders requests, but it prevents the 
nonSMC systems from taking advantage of the potential 20 
concurrency. In order to illustrate the effects of alignment on 
Algorithm T1 
Like P1, Algorithm T1 issues at most one access each bus 
cycle. Instead of considering each idle bank in turn when 
bandwidth, the nonSMC results presented for medium- 
length vectors represent starting addresses with staggered 
alignment: the ith vector in the pattern begins in bank (i mod 
n), where n is the number of banks, In spite of the more 
favorable alignment, nonSMC daxpy performance is limited 25 depict TI’S Performance. 
to 30.0% of total bandwidth for a two-bank memory; hydro, 
swap, and vaxpy are limited to 18.8%, 40.0%, and 25.0%, 
mance is unchanged. 
are inadequate. For a stride-one vector, each bank will be 
responsible for servicing only one FIFO position, which 
severely limits the S M C ’ ~  ability to amortize DRAM page- 
attempting to initiate an access, T1 only considers the next 
bank in round-robin order from the last bank considered. If 
that bank is busy, or if no ready access to it exists, then no 
access is initiated at the current time. FIGS. 10, 17 and 18 
Again, the performance curves are very similar to those 
for p1 and R1, with results for all but the shallowest FIFOs 
differing by less than 1% of attainable bandwidth. Results 
for FIFOs that are only eight double-words deep vary by 
For a memory system with eight banks, eight-deep FIFOs 30 more than 15% of attainable bandwidth, but only for the 
eight-bank memory system, where the SMC cannot take 
advantage of Page hits. Algorithm T1 slightly out Performs 
the other two for some benchmarks. For short vectors, as 
respectively, Since scale uses only one vector, its perfor- 
miss costs, The SMC’s memory access pattern for each bank 
in this is almost the Same as that generated by the 
processor, hence performance tends to si& towards that of 35 marks run ’’ a memory system with two banks, 
depicted in 
centage Of peak bandwidth for the 
18, Algorithm T1 a higher per- 
and Vaxpy bench- 
the 
margin is only a few percent. For instance, on the scale 
a nonSMC system. Note that even when the SMC can’t take computation, Algorithm T1 achieves 36,4% of the peak 
reads and buffers writes, thus it still offers some performance reaches only 32.8%. The Same benchmark on a two-bank 
advantages. In general, the greater the concurrency inherent architecture yields 69.0% of peak for Algorithm TI, as 
in the memory system, the deeper the SMC’s FIFOs need to 40 opposed to 64.5% for Algorithm ~ 1 ,  
be in order to amortize each bank‘s page-miss overhead. The trends among the performances of the P, R, and T 
FIG. 14 illustrates SMC performance on very short (10- bank-selection schemes are present for all groups of algo- 
element) vectors. Performance improvements are not as rithms simulated, but there is simply too much data to make 
dramatic as for longer vectors, for there are very few meaningful comparisons between all ordering algorithms. 
accesses over which to amortize page-miss costs. 45 Since Scheme T is the most reasonable from an implemen- 
Nonetheless, short vector computations benefit significantly tation standpoint, testing was focused on ordering algo- 
from an SMC. As noted above, nonSMC performance is, as rithms employing this strategy, and Algorithm T1 was as a 
depicted in FIG. 6 or FIG. 12, dependent on vector align- basis of comparison for performance of the other algorithms. 
ment. All the SMC results presented thus far have been for 
Algorithm R1 50 vectors aligned such that corresponding elements of the 
This greedy algorithm is identical to P1, except that only vectors reside in the same memory bank. This placement 
one access may be issued during any one bus cycle. The degrades the memory performance of nonSMC systems, for 
algorithm examines the banks in round-robin order, begin- it generates bank conflicts and can cause thrashing behavior 
ning with the bank following the one to which the most with respect to DRAM pages. Since the SMC reorders 
recent access was made. It attempts to initiate an access 55 accesses to take advantage of the memory system’s available 
(according to the scheme described for P1, above) for the bandwidth, it is relatively insensitive to operand placement 
first idle bank it finds. FIGS. 8, 12 and 15 depict Rl’s and alignment. To illustrate this, FIGS. 19 through 21 depict 
performance. SMC performance for Algorithm T1 using the same vector 
All three bank-selection schemes perform identically for alignment as for the nonSMC results in FIG. 11, FIG. 12, 
all benchmarks on a single-bank memory system. For this and FIG. 17. In this alignment, the ith vector in the pattern 
FIFO-selection scheme, Rl’s performance is extremely 6o begins in bank (i mod n), where “n” is the number of banks. 
similar to that of algorithm P1-for systems with two and Non-SMC results in FIG. 19 are as in FIG. 10, where vectors 
four banks, performance is identical. For SMC systems with are aligned to begin in the same bank. NonSMC results in 
eight banks, performance of the two schemes differs only for FIG. 20 use the alignment just describe for this set of SMC 
very shallow FIFOs, where the SMC is unable to take experiments, are thus are the same as in FIG. 17. Since swap 
advantage of page hits. 65 is unaffected by alignment, results for that benchmark are 
In fact, performance of all the R algorithms is remarkably identical to the corresponding T1 results in FIG. 10,17, and 
similar to that of the P algorithms. This stems from the 18. 
advantage Of page-mode accesses, it prefetches bandwidth on an eight-bank system, whereas Algorithm R1 
6,154,826 
23 24 
The differences in performance are summarized in Table Group 2-Algorithms P2, R2, and T2 
1. Table entries are obtained by subtracting the performance Algorithms P2, R2 and T2 are similar to those described 
numbers from FIG. 10, FIG. 17, and FIG. 18 from the for Group 1-Algorithms PI,  R1 and TI ,  except that they 
corresponding results in FIG, 19, FIG, 20, and FIG, 21, The incorporate the notion of a threshold of required service. For 
largest differences occur for memory systems with many 5 each memory bank “b” selected by the access-initiation 
banks, especially with shallow FIFOs, where the lack of scheme (p, R, or T), the FIFO-selection algorithm examines 
buffer space prevents the SMC from effectively amortizing the FIFOs in round-robin order, beginning with last FIFO for 
page-miss costs, Differences for shorter vectors, although which an access to “b” was initiated. If it finds an access that 
hits the current DRAM page, it issues that access. If no not included here, are smaller still. 
accesses for the bank hit the current DRAM page, then it 
looks for an access from a FIFO containing at least n/2 ready 
accesses, where “n” is the number of FIFO positions that 
map to bank “b”. If a FIFO requiring the appropriate amount 
of service is found, an access is initiated. If no such FIFO 
15 exists, the algorithm defaults to using the next FIFO 
(following the one for which the most recent access to bank 
“b” was initiated), attempting to initiate an access for it. 
The performance of the Group 2 algorithms is depicted in 
FIG. 22 through FIG. 31. Performance is extremely similar 
2o to that of the corresponding algorithm from Group 1, gen- 
COPY 1 erally differing by less than 1% of peak bandwidth. The only 
2 exception is the hydro benchmark. For medium-length 
vectors, FIFOs of depth sixty-four, and an eight-bank 
1 + - + - + memory, Group 1 beats Group 2 by almost 4% of peak, yet 
2 + - + - - - 25 for a two-bank system with FIFOs half that depth, the Group 
4 -4.0 + - - + - 2 algorithms represent a performance gain of over 2% of 
peak. For longer vectors, the differences are magnified, and 8 -12.0 -4.0 + - - - 
2 -1.9 -1.5 - the effect of the threshold is erratic. For the four- and 
4 + +4.4 -1.6 + -2.4 - eight-bank memories, Group 2 performance varies from 
8 - +3.0 -3.7 -1.8 +5.2 - 0.3% of peak worse to 6.7% better (most FIFO depths gain 
at least 4% of peak), and there is no clear trend in the 
4 - - -  - + variations in performance. For hydro on very short vectors, 
8 - -1.2 Group 1 beats Group 2 by 5.2% of peak for very shallow 
vaxPY 1 + - + + + + FIFOs on a two-bank memory system. The fact that the 
35 threshold has relatively little effect on the performance for 
8 -4,0 + + - + + most benchmarks suggests that when a DRAM page change 
is necessary, the FIFO requiring the most service either 
meets the threshold or happens to be the default selection. 
TABLE 1 
T1 Long Vector Performance Differences 
for Two Vector Alignments 
(staggered minus single-bank) 
Differences in Attained Percentage of Peak Bandwidth 
FIFO depth 
benchmark memory banks 0 16 32 64 128 256 
4 
8 
daxPY 
hydro 1 + 
swap 1 + -  + 30 
2 - + + + + +  
2 + + - + + +  
4 - + + - + +  
TABLE 2 
Performance of Scheme T2 with Respect to T1 
Differences in Attained Percentage of Peak Bandwidth 
medium vectors long vectors 
FIFO depth FIFO depth 
benchmark banks 8 10 32 64 120 8 16 32 61 128 256 
COPY 1 
2 
4 
8 
2 + - -2.2 
4 +1.4 
8 
2 
4 
8 
daxPY 1 
hydro 1 
scale 
swap 
-3.6 
+ + +  
+ + 
+ - + + +2.2 +1.9 
+5.9 +1.6 +1.2 +4.1 +4.6 
+5.9 +6.6 +5.6 +6.6 
+ + + + +  
+ + + +  
+ - + + + +  
+ + + + +  
+ 
+ + + + + +  
+ + + + +  
25 
TABLE 2-continued 
6,154,826 
26 
Performance of Scheme T2 with Respect to T1 
Differences in Attained Percentage of Peak Bandwidth 
medium vectors long vectors 
FIFO depth FIFO depth 
benchmark banks 8 10 32 64 120 8 16 32 61 128 256 
vaxpy 1 
2 
4 
8 
- +  
+ + + + + +  
+ + + + +  
15 
Group 3-Algorithms P3, R3, and T3 able bandwidth for the daxpy and hydro benchmarks, to a 
Group 3 algorithms are almost identical to Group 2 15.3% decrease for copy. 
algorithms, except that when there are no more ready Again, there is no discernible pattern to the performance 
accesses that hit the current page of the chosen bank and no variations, but now scale is the only benchmark whose 
FIFO meets the required threshold for service, no access is 2o performance remains unchanged. For instance, Algorithm 
initiated. The intent is to amortize the cost of a DRAM page R3’s performance on daxpy for 100-element vectors and a 
miss over as many page hits as possible. If it is necessary to four-bank memory using sixteen-deep FIFOs is 67.0% of 
switch pages but there are sufficiently few accesses that peak. R2 and R1 both deliver 69.3%, a difference of only a 
would hit the new page, delay paying the page-miss cost is few percent. On the copy benchmark on a two-bank system 
delayed until there are more accesses to offset the overhead. 25 with eight-deep FIFOs, however, the difference goes the 
The performance of the Group 3 algorithms is depicted in other way-R3 attains 68.3% of peak, whereas R2 and R1 
FIG. 32 through FIG. 39. The fact that these algorithms deliver 66.4%. For FIFOs of sixteen double-words and the 
occasionally choose to do nothing has little or no effect on same number of banks, R2 and R1 once again win out with 
long vector performance. The medium vector performance 80.0% over 77.8%. 
tends to be slightly lower than for the algorithms in Group 3o There seems to be little advantage in waiting for a certain 
1 or Group 2, and short vector performance generally suffers number of accesses to a DRAM page to accumulate before 
a bit more. paying the page-miss overhead. Although doing so occa- 
For long vectors, the differences in performance between sionally improves bandwidth, it also frequently diminishes 
the Group 3 and Group 1 schemes is generally within 1% or performance, and the drops seen are about twice as large as 
2% of peak bandwidth, plus or minus. Again, the hydro 35 the gains. Indeed, performance may suffer appreciably under 
benchmark represents the exception. Here the mean perfor- such a policy. This is advantageous from an implementation 
mance gain for all FIFO depths and interleaving factors is standpoint, since incorporating the threshold would require 
4.1% of peak, and the maximum is 10.6% for eight banks extra circuitry, and complicate the selection logic. 
and depth-64 FIFOs. Performance is more erratic for Table 3 summarizes T3’s performance with respect to T1. 
medium vectors, ranging from a 5.8% gain in peak band- 4o Blank entries indicate that differences, if any, are less than 
width to an 11.9% drop (as compared with the corresponding 0.01%. Numerical values are given for differences of mag- 
Group 1 algorithms). Performance for short vectors exhibits nitude greater than 1%; entries of lesser magnitude are 
similar fluctuations, ranging from a 6.9% increase in attain- represented by the sign of the difference. 
TABLE 3 
Performance of Scheme T3 with Respect to T1 
Differences in Attained Percentage of Peak Bandwidth 
medium vectors long vectors 
FIFO depth FIFO depth 
benchmark banks 8 10 32 64 120 8 16 32 61 128 256 
COPY 1 + 
2 +2.0 -2.5 
4 
8 
2 +1.8 - -1.9 
4 
8 -1.3 
2 +5.0 + -3.2 + 
4 +1.5 -3.2 - 
8 -1.5 -3.2 
scale 1 
2 
4 
8 
daxPY 1 - 
hydro 1 
-11.9 - 
-11.5 + +  
-10.6 - +  
-7.5 
- - -  
+1.4 + + 
+1.7 + 
+1.7 
+O.O +3.8 +2.7 
+8.3 +4.4 +3.0 
- +6.0 
+ i  
t 
+1.9 +2.3 
+5.2 +4.4 
+10.8 +6.2 
+ +  
+ +  
+ 
I 
- 
I 
+1.5 
+8.8 
+ 
I 
27 
TABLE 3-continued 
6,154,826 
28 
Performance of Scheme T3 with Respect to T1 
Differences in Attained Percentage of Peak Bandwidth 
medium vectors long vectors 
FIFO depth FIFO depth 
benchmark banks 8 10 32 64 120 8 16 32 61 128 256 
swap 1 + +1.0 +1.0 -2.2 + 
2 + +1.6 +2.3 -1.6 +1.8 
4 + +1.2 -1.2 +3.8 
8 + +1.4 +6.0 
2 - -1.5 - 
4 -1.6 - 
8 +3.6 
vaxpy 1 -1.6 
+ + + -  
+ + + + + +  
+ + + -  
+ + + -  
- 
- _ _ _ _  
+ + + + + +  
+ + + + +  
- + - +  
Group 4-Algorithms P4, R4, and T4 
These algorithms simply look for accesses that hit the 
current page of the selected bank, and if they find none, they 
choose the next FIFO in sequence. Unlike the previous 
schemes, they do not try to choose the “best” FIFO to service 
in the event of a necessary page miss. 
Intuitively, it would seem that these “less intelligent” 
algorithms would not perform as well as their more sophis- 
ticated counterparts in Groups 1 through 3. This turns out not 
to be the case. As depicted in FIG. 40 through FIG. 48, 
performance of these algorithms rivals that of the corre- 
sponding members of Group 1 and Group 2. 
For long vectors, shown in FIG. 40, FIG. 43, and FIG. 46, 
percentages of peak bandwidth obtained by these algorithms 
are usually within a few tenths (plus or minus) of those 
obtained by the more sophisticated algorithms. For the 
hydro benchmark, these algorithms often beat the others by 
over 10% of the attainable bandwidth (up to 13.2%, in the 
case of depth-64 FIFOs on an eight-bank memory system). 
For medium vectors, depicted in FIG. 41, FIG. 50, and 
FIG. 53, performance is virtually identical to that for Group 
1 on most benchmarks (copy, daxpy, scale, and vaxpy). 
Hydro again benefits from this simpler FIFO-selection 
algorithm, although by a somewhat smaller margin than for 
long vectors. For eight- and sixteen-deep FIFOs, T4 delivers 
20 
25 
30 
3s  
40 
62.3% and 76.5% of peak bandwidth on a two-bank system, 
whereas T1 reaches only 57.4% and 72.3%-a difference of 
over 4% of peak in both cases. On an eight-bank memory 
using a FIFO depth of sixty-four, however, T4 delivers only 
65.2% of the attainable bandwidth, but T1 is able to deliver 
68.8%. T1 again beats T4 by a few percent on the swap 
benchmark for very shallow FIFOs on two- and eight-bank 
systems. Performance for the P and R schemes is similar: 
hydro performance of the Group 4 schemes is several 
percent better than that of the corresponding Group 1 
schemes in some cases, but swap performance tends to be a 
few percent worse in others. 
The short vector performance shown in FIG. 42, FIG. 45, 
and FIG. 48 is precisely the same as for Group 1, except for 
hydro. Here the Group 4 schemes deliver slightly over 5% 
less of peak bandwidth than the Group 1 schemes for very 
shallow FIFOs and a two-bank memory, and they exhibit 
smaller performance fluctuations for memory systems with 
a higher interleaving factor. This set of algorithms both 
performs well (for deeper FIFOs, performance is very com- 
petitive with that of the corresponding Group 1 schemes) 
and would be easier to implement than the others described 
thus far. The combination of bank-selection and FIFO- 
selection algorithms represented by T4 would be particularly 
straightforward. 
TABLE 4 
Performance of Scheme T4 with Respect to T1 
Differences in Attained Percentage of Peak Bandwidth 
medium vectors long vectors 
FIFO depth FIFO depth 
benchmark banks 8 10 32 64 120 8 16 32 61 128 256 
COPY 1 
2 
4 
8 
daxPY 1 
2 
4 
8 
2 +4.9 +4.2 +1.4 
4 +4.8 +3.0 +2.1 
8 +2.2 +2.9 +2.4 -3.6 
scale 1 
2 
4 
8 
hydro 1 
- - 
+ + +  
+ + + + +  
- -  
_ _ _ _ _  
+7.1 +6.5 +6.3 +3.3 +3.0 
+7.1 + l o 2  +7.2 +6.2 +8.5 
+2.7 +9.2 +10.4 +13.2 +10.7 
+ + + + +  
+ + + + +  
+ 
+ 
- 
+ 
+ 
+ 
+2.2 
+5.3 
+8.6 
+ 
I 
29 
TABLE 4-continued 
6,154,826 
30 
Performance of Scheme T4 with Respect to T1 
Differences in Attained Percentage of Peak Bandwidth 
medium vectors long vectors 
FIFO depth FIFO depth 
benchmark banks 8 10 32 64 120 8 16 32 61 128 256 
swap 1 
2 
4 
8 -2.3 
2 
4 
8 
vaxpy 1 
_ _ _ _ _ _  
+ - + + + +  
+ + + + + +  
-3.8 
+ +  
+ + + + + +  
+ + + + + +  
_ _ _ _  
Group 5-Algorithms P5, R5, and T5 
Group 4 indicated that a simpler ordering algorithm may 
yield better performance. The problems arises, however, as 
to how simple can the scheme be and still achieve high 
bandwidth. To determine this, an ordering scheme that 
doesn’t even look for accesses that hit a bank‘s current 
DRAM page was implemented. These algorithms merely 
issue accesses for the current FIFO until no more ready 
accesses remain, then they move on to the next FIFO in 
round-robin order. FIG. 49 through FIG. 57 illustrate the 
attainable bandwidth less for T5 on swap using eight-deep 
20 FIFOs and an eight-bank system, or a 13% of peak drop over 
T4’s performance. When FIFO depth is scaled with the 
interleaving factor, performance differences are small. 
Medium vector SMC performance is depicted in FIG. 50, 
FIG. 53, and FIG. 56. These performance curves exhibit 
2s similar trends as those for long vectors when compared with 
the corresponding curves for Group 1. Daxpy and vaxpy fare 
slightly worse for shallow FIFOs, and swap’s performance 
is slightly lower overall. Again, these algorithms achieve a 
performance of this group of algorithms. higher percentage of peak bandwidth on the hydro 
FIG. 49, FIG. 52, and FIG. 55 illustrate long vector SMC 30 benchmark, but performance drops slightly (3.6% of peak) 
performance. The curves for copy and scale are virtually for depth-64 FIFOs and an eight-bank memory. For deep 
identical to those for Algorithm P1. On the hydro FIFOS, performance for all benchmarks converges to that 
benchmark, performance is identical to that of the Group 4 achieved by the other selection algorithms. 
schemes. For daxpy, swap, and vaxpy using shallower Short vector performance is almost identical to that of the 
FIFOs, the performance for eightbank memory systems is corresponding algorithms in Group 4, except for a slight 
worse than that for the Group 1 schemes-up to 17.1% of drops for eight banks and shallow FIFOs. 
TABLE 5 
Performance of Scheme TS with Respect to T1 
Differences in Attained Percentage of Peak Bandwidth 
medium vectors long vectors 
FIFO depth FIFO depth 
benchmark banks 8 10 32 64 120 8 16 32 61 128 256 
COPY 1 
L. 
4 
8 
2 +1.6 +1.2 
4 -2.9 - +2.4 
8 8.4 - - 
2 +4.9 +4.2 +1.4 
4 +4.8 +3.0 +2.1 
8 +2.2 +2.9 +2.4 -3.6 
scale 1 
2 
4 
8 
2 -3.4 -1.0 
4 -3.3 -3.0 
8 -12.1 -3.7 -2.5 
2 + +  
4 -2.0 - 
8 -2.4 -2.0 
daxPY 1 +1.2 + 
hydro 1 
swap 1 -  
vaxPY 1 + +  
- - 
+ + +  
+ + + + +  
- -  
+ - + + +  
_ _ _ _  
-12.8 + - - 
+7.1 +6.5 +5.3 +3.3 +3.0 
+7.1 + l o 2  +7.2 +6.2 +6.5 
+2.7 +9.2 +10.4 +13.2 +10.7 
_ _ _ _ _  
+ + 
+ + + t  
_ _ _ _ _  
-3.0 - - - + 
-4.2 -3.1 -1.1 - - 
-17.1 -4.2 -3.1 -1.1 - 
_ _ _  + 
+ + + + +  
-3.0 + - + + 
-4.0 -4.6 - -  
+ 
+ 
t 
+ 
+ 
t 
+ 
+2.2 
+5.3 
+8.8 
6,154,826 
31 
Group 6-Algorithms A1 and A2 
The algorithms discussed thus far generate memory 
accesses by first choosing a bank (or banks) to access, and 
then choosing the appropriate FIFO (or FIFOS) for which to 
initiate accesses. The algorithms in Group 6 perform their 
duties in the opposite order: first they choose a FIFO to 
service, and then they choose the bank to access. 
Algorithm A1 goes round-robin through the FIFOs, ini- 
tiating accesses for the current FIFO until it contains no 
ready accesses. At that point, the SMC advances to the next 
FIFO and proceeds to initiate accesses for it. While servicing 
a particular FIFO, if the next ready access from that FIFO is 
to a busy bank, the SMC waits until the bank is idle, it does 
not try to find an access to a currently idle bank. Results for 
this ordering scheme are depicted in FIG. 58 through FIG. 
60. 
Algorithm A2 is a slightly more sophisticated version of 
A l ,  incorporating a threshold similar to that of the algo- 
rithms in Group 3. If the SMC determines that the next 
access from the current FIFO will generate a DRAM page 
miss, it decides whether or not to switch to a different FIFO. 
When it must issue an access that misses a bank‘s current 
page, it attempts to choose the access from a FIFO that 
contains ready accesses equal to at least half its depth. If the 
current FIFO requires enough service, the access is issued 
for it. Otherwise the SMC looks at the next FIFO in 
sequence, and so on. If no FIFOs meet the threshold, the 
algorithm issues no accesses at that time. Performance of 
this algorithm is illustrated in FIGS. 9, 13 and 16. As 
expected, simulation results for these algorithms exhibit the 
same degradation in performance that were seen with many 
of the other algorithms for shallow FIFOs on memory 
systems with a high degree of concurrency. 
For long vectors, performance tends to be lower than that 
of Algorithm P1 for most benchmarks run with FIFOs up to 
32 double-words deep. Hydro is the exception to this: 
5 
10 
15 
20 
25 
30 
35 
32 
Algorithm A1 out performs the Group 1 schemes for all 
FIFO depths and memory systems. For deeper FIFOs, Al’s  
performance for all benchmarks is within a few percent of 
that for the Group 4 algorithms, but for shallow FIFOs 
(especially on a memory system with many banks), its 
performance dips to 16.9% of peak less. 
For medium vectors, performance again tends to be lower 
than that of the Group 1 algorithms for FIFOs of depth eight, 
sixteen, and thirty-two. When compared with Group 4, these 
algorithms provide virtually identical performance for 
deeper FIFOs, but performance is often over 10% of peak 
worse for shallow FIFOs and higher interleavings. 
Short vector performance is similar to that of Algorithm 
P1, but A1 performed worse in a few instances. Most 
benchmarks fare worse with eight-deep FIFOs, regardless of 
the number of banks in the memory system. Al’s  perfor- 
mance on the swap kernel on a two-bank system is about 5% 
of peak below that of Pl’s. 
On long vectors, A2 performs almost identically to A l .  
On medium vectors, however, A2 fares significantly worse 
for deeper FIFOs on the copy benchmark. Smaller drops in 
performance are evident for the swap and hydro benchmarks 
for FIFOs of sixty-four or more double-words. Medium 
vector performance for the other benchmarks is about the 
same as for A l ,  with performance generally dropping by less 
than 2% of attainable bandwidth. On short vectors, the 
bandwidth delivered by A2 on the copy benchmark is much 
lower-almost 20% of peak difference for a single-module 
system. A2 performs about the same as A1 on the scale 
benchmark, and performance for the two algorithms is 
similar for the daxpy, vaxpy, and swap kernels with FIFOs 
at least sixteen deep. A2 consistently out performs A1 for 
very shallow FIFOs and 8-bank interleavings, and for swap 
in general on all but the single-bank memory (but only by 
one or two percent of peak, in the latter case). Neither of 
these is a strong argument in favor of A2. 
TABLE 6 
Performance of Scheme A1 with Respect to T1 
Differences in Attained Percentage of Peak Bandwidth 
medium vectors long vectors 
FIFO depth FIFO depth 
benchmark banks 8 10 32 64 120 8 16 32 61 128 256 
COPY 1 
2 +  
4 +1.0 
8 -10.0 + 
2 -2.6 - +1.2 
4 -9.6 -2.6 
8 -12.6 -7.1 -4.1 
2 +2.4 +2.7 
4 -1.5 +4.4 
8 -2.8 -1.7 + -4.7 
2 -6.0 -2.7 - - 
4 -11.0 -4.0 - - 
8 -9.1 -0.5 +1.0 +1.0 
2 -5.5 -1.4 -1.5 - 
4 -0.1 -4.0 -1.6 - 
8 -10.0 -0.9 -2.5 - 
2 - - -1.0 - 
4 -5.2 -4.0 -1.1 - 
8 -5.1 -4.0 - - 
daxPY 1 +1.2 + 
hydro 1 
scale 1 
swap 1 -  
vaxPY 1 + +  
+ + + + +  
+ + + +  
-13.8 + 
-3.4 -2.2 -1.1 - - 
-11.5 -5.4 -2.1 -2.0 -1.0 
-10.1 -11.5 -6.0 -2.1 -2.1 
+4.6 +4.6 +4.2 +3.1 +2.6 
+9.3 +5.6 +5.0 +O.O 
-2.0 +2.1 +8.0 +11.7 +9.6 
+ +  - - -  
- - - - -  
- -8.6 -3.2 -1.0 - - 
- -14.1 -0.0 -3.2 -1.5 - 
+1.9 -13.3 -14.2 -0.0 -3.2 -1.5 
- - - -  
-4.7 -2.1 -1.6 - - 
+ -6.7 -5.9 -1.0 -1.4 - 
-1.4 <20.7 -0.0 -5.9 -2.5 -1.6 
- -1.7 - - - - 
- -7.5 -4.4 -2.2 -1.1 - 
- -6.2 -10.4 -2.3 -2.7 -1.9 
- - -  + 
+ 
+ 
+ 
- 
- 
+ 
+2.0 
+5.0 
+8.2 
- 
- 
- 
+ 
+ 
- 
+ 
+ 
+ 
- 
33 
TABLE 7 
6,154,826 
34 
Performance of Scheme A2 with Respect to T1 
Differences in Attained Percentage of Peak Bandwidth 
medium vectors long vectors 
FIFO depth FIFO depth 
benchmark banks 8 10 32 64 120 8 16 32 61 128 256 
COPY 1 
2 
4 
8 
2 
4 
8 
2 
4 
8 
scale 1 
2 
4 
8 
2 
4 
8 
2 
4 
8 
daxpy 1 
hydro 1 
swap 1 
vaxpy 1 
-11.9 - 
+ -2.5 -11.9 + + -  
+1.0 - -11.5 + + -  
-10.9 + -0.4 -13.8 
+1.2 + 
-2.6 - +1.2 -3.4 -2.2 -1.1 - 
-9.0 -2.6 -11.5 -5.4 -2.1 -2.0 
-11.5 -7.1 -4.1 -16.1 -11.5 -6.0 -2.1 
-4.9 +3.3 - - +4.6 +4.8 +4.2 +3.1 
-1.5 +4.0 -1.0 -2.4 - +0.3 +5.5 +5.0 
-1.7 -1.7 + -8.2 -2.0 -2.6 +2.1 +8.3 +11.7 
- -  
- - -  
-0.0 -2.7 - - - 
-11.0 -4.0 - - - 
-9.1 -0.5 +1.0 +1.0 +1.0 
- -3.5 - 
-5.3 -1.9 -2.2 -3.3 - 
-8.3 -4.9 -1.5 -2.0 - 
-14.7 -6.1 -3.6 - -1.8 
-2.8 - - - - 
-5.9 -3.4 - - - 
-4.4 -7.0 + - - 
- -  
+ +  
-0.0 
-14.1 
-13.3 
-5.6 
-0.7 
-19.9 
-2.2 
-7.5 
-8.2 
+ 
-3.1 
-0.5 
-14.2 
-2.2 
-5.9 
-8.6 
-1.0 
-2.2 
-10.4 
-1.6 - 
-3.1 -1.5 
-8.6 -3.2 
-1.8 - 
-1.9 -1.4 
-5.9 -2.6 
-1.4 - 
-2.2 -1.7 
-2.3 -2.7 
- -  
- -  
+ 
- 
- 
-2.1 
+2.0 
+6.0 
+10.6 
+ 
+ 
+ 
+ 
- 
+2.0 
+5.0 
+8.2 
-1.0 
- 
Performance of the different access ordering schemes 
tends to be very similar. Herein is summarized the perfor- 
mance of the remaining five FIFO-selection algorithms (6 35 
through 10) when paired with the T bank-selection scheme. 
Table 2 through 7 indicate relative performance of these 
schemes as compared to Scheme T1. 
Algorithm 6 is identical to Algorithm 1, except that the 
search for the FIFO requiring the most service from the 40 
current bank begins with the last FIFO accessed by any 
bank. Performance of Algorithm 6 is summarized in Table 8. 
Algorithm 7 is similar, except that when a page-miss is 
inevitable, it chooses the next access from the FIFO requir- 
ing the most service from all banks, starting the search with 
the last FIFO accessed by the current bank. Algorithm 8 is 
identical, except that the search for the FIFO requiring the 
most service begins with the last FIFO accessed by any 
bank. Performance for Algorithms 7 and 8 are summarized 
in Tables 9 and 10 respectively. 
Algorithm 10 resembles Algorithm 5 in that neither 
explicitly tries to initiate accesses that hit the current DRAM 
page. Algorithm 10 issues the next access it finds, and 
considers the FIFOs in round-robin order beginning with the 
last FIFO accessed by any bank. Algorithm 5 begins its 
search with the last FIFO accessed by the current bank. 
Algorithm 10’s performance is summarized in Table 12. 
TABLE 8 
Performance of Scheme T6 with Respect to T1 
Differences in Attained Percentage of Peak Bandwidth 
medium vectors long vectors 
FIFO depth FIFO depth 
benchmark banks 8 10 32 64 120 8 16 32 61 128 256 
COPY 1 
2 
4 
8 
2 
4 
8 
2 
4 
8 
daxPY 1 
hydro 1 
+ 
- 
+ 
6,154,826 
35 
TABLE 8-continued 
36 
Performance of Scheme T6 with Respect to T1 
Differences in Attained Percentage of Peak Bandwidth 
medium vectors long vectors 
FIFO depth FIFO depth 
benchmark banks 8 10 32 64 120 8 16 32 61 128 256 
scale 1 
2 
4 
8 -8.3 -13.5 
2 +1.1 - + + - -  
4 - + + +  
8 + +  
2 + + +  
4 + + + + + -  
8 -1.2 - + 
swap 1 
vaxpy 1 
TABLE 9 
Performance of Scheme T7 with Respect to T1 
Differences in Attained Percentage of Peak Bandwidth 
medium vectors long vectors 
FIFO depth FIFO depth 
benchmark banks 8 10 32 64 120 8 16 32 61 128 256 
COPY 1 
2 
4 
8 
2 +1.4 
4 -2.5 
8 -4.4 
2 +3.4 
4 -1.4 
8 +1.5 
scale 1 
2 
4 
8 
2 -  
4 
8 -6.6 
2 -2.0 
4 -6.5 
8 -1.7 
daxpy 1 
hydro 1 
swap 1 
vaxpy 1 
+ 
+ 
-3.3 
+4.2 
+1.2 
-1.9 
-1.5 
-7.7 
-5.9 
- 
+ 
- +1.4 
+1.8 -1.7 
-3.4 -6.8 
+2.3 +7.0 
-1.9 -1.3 
-3.6 +2.6 
+ 
+ 
+ 
-11.2 
-4.6 - - -1.1 
-4.9 - -6.6 
-2.3 + + -2.9 
+ 
+ 
+ 
+ 
-4.4 
+6.5 
+4.6 
+ 
+ 
+ 
- 
- 
+ 
-6.5 
-10.2 
+ 
+ 
+ 
- 
- 
+5.2 
+7.1 
+1.0 
+ 
+ 
+ 
- 
+ 
-3.9 
-5.8 
+ +  
+ +  
+ +  
+ +  
- 
+3.3 +2.8 
+5.9 +5.8 
t13.2 +7.5 
+ +  
+ +  
+ +  
+ +  
- -  
- -  
-2.2 - 
+ 
+ 
+ 
+ 
+ 
+1.7 
+5.0 
+4.8 
+ 
+ 
+ 
+ 
- 
+ 
- 
-2.4 
TABLE 10 
Performance of Scheme T8 with Respect to T1 
Differences in Attained Percentage of Peak Bandwidth 
medium vectors long vectors 
FIFO depth FIFO depth 
benchmark banks 8 10 32 64 120 8 16 32 61 128 256 
COPY 1 
2 - + + + + +  
37 
TABLE 10-continued 
6,154,826 
38 
Performance of Scheme T8 with Respect to T1 
Differences in Attained Percentage of Peak Bandwidth 
medium vectors long vectors 
FIFO depth FIFO depth 
benchmark banks 8 10 32 64 120 8 16 32 61 128 256 
4 
8 
2 +1.4 + - 
4 -1.6 -1.6 
8 -9.1 -2.4 -3.3 
2 +4.1 +4.2 +2.3 
4 - +2.7 -3.8 
8 +1.1 -1.7 -4.3 -3.6 
scale 1 
2 
4 
8 -8.3 
2 
4 
8 -6.6 
2 -1.0 + -4.6 - - 
4 -7.4 -4.7 -5.3 
8 -1.6 -6.3 -4.1 -2.3 
daxpy 1 
hydro 1 
swap 1 
vaxpy 1 
+ + + + + +  
+1.4 + + + + + 
+ +1.0 - + - + 
-12.8 + - - + 
+7.3 +6.5 +5.2 +3.3 +2.8 +1.7 
- +3.6 +8.9 +5.9 +5.8 +5.2 
+2.6 + +1.0 +13.2 +10.3 +5.5 
+ + + + + +  
+ + + + + +  
-13.3 
+ - + + + +  
+ +1.1 + + + 
-10.0 + + +  
-1.1 + + - - + 
-6.6 -6.5 -3.9 - - - 
-2.9 -10.2 -5.6 -2.2 - 
-2.4 
Algorithm 9 resembles Algorithm 4, in that it tries to issue its search for this access with the last FIFO accessed by the 
current bank while Algorithm 9 begins with the last FIFO 
accessed bv any bank. Table 11 summarizes this algorithm’s accesses that hit the current DRAM page, but when it 
I I  I 
cannot, it chooses the next access found. Algorithm 4 begins performance. 
TABLE 11 
Performance of Scheme T9 with Respect to T1 
Differences in Attained Percentage of Peak Bandwidth 
medium vectors long vectors 
FIFO depth FIFO depth 
benchmark banks 8 10 32 64 120 8 16 32 61 128 256 
COPY 1 
L. 
4 
8 
daxPY 1 
2 
4 
8 +  -1.3 
2 +4.9 +4.2 +1.4 
4 +4.5 +3.0 +2.1 
8 + +2.9 +2.4 -3.6 
scale 1 
2 
4 
8 -6.3 
2 -  
4 +1.9 
8 -3.0 - 
2 -1.2 
4 
8 -1.4 -2.4 
hydro 1 
swap 1 
vaxPY 1 
+ 
- 
+ 
- 
+7.2 
+7.1 
+ 
+ 
+ 
-13.3 
- 
+ 
-0.0 
- 
+ 
+ 
+ 
- 
+ + + +  
+ + + +  
- - - -  
+6.5 +5.3 +3.3 +3.0 
+10.2 +7.2 +6.2 +6.5 
+9.2 +10.4 +13.2 +10.0 
+ + + +  
+ + + +  
+ + + +  
+ + + +  
+ +  
- - -  
+ + + +  
+ -1.4 - + 
- +  
+ 
+ 
- 
+ 
+ 
+ 
+2.2 
+5.3 
+0.8 
+ 
+ 
- 
+ 
+ 
+ 
+ 
I 
39 
TABLE 12 
6,154,826 
40 
Performance of Scheme T10 with Respect to T1 
Differences in Attained Percentage of Peak Bandwidth 
medium vectors long vectors 
FIFO depth FIFO depth 
benchmark banks 8 10 32 64 120 8 16 32 61 128 256 
COPY 1 
2 
4 
8 
2 -1.6 + 
4 -4.6 -3.7 
8 -11.9 -4.2 
2 +4.0 +4.2 
4 -7.7 +3.7 
8 -2.3 -7.7 
scale 1 
2 
4 
8 -8.3 
2 -3.4 -1.0 
4 -10.8 -7.1 
8 -19.9 -10.6 
2 + +  
4 -9.0 -10.4 
8 -4.3 -14.8 
daxpy 1 +1.2 + 
hydro 1 
swap 1 -  
vaxpy 1 + +  
+1.2 
-1.1 
-7.2 
+1.4 
+2.1 
-2.8 
-0.2 
+ 
-3.0 
- 
-2.8 -1.2 
+ +  
- -  
-4.5 - 
-5.5 -2.9 
-4.7 -17.6 -4.4 
+5.8 +6.0 
-2.3 +6.5 
- -  
-10.7 -2.8 - 
+ +  
+ +  
-13.3 
-3.0 -1.7 
-8.7 -4.0 
-0.0 -10.1 -29.2 -10.2 
+ -5.4 -0.2 
+ -11.7 -5.0 
-3.9 + -6.2 -10.2 
- -  
- 
- 
+ 
+ 
- 
+ 
- 
-2.0 
+5.1 
+6.5 
+0.5 
- 
+ 
+ 
- 
- 
-1.5 
-15.6 
- 
+ 
-5.0 
- +  
+ +  
- +  
+ +  
- +  
- -  
- -  
+3.3 +2.90 
+5.9 +6.5 
+12.7 +10.4 
+ +  
+ +  
- -  
- +  
- -  
-1.4 - 
+ 
+ +  
- +  
- -  
+ 
+ 
+ 
+ 
+ 
0.2 
+2.2 
+5.3 
+8.7 
+ 
+ 
+ 
+ 
+ 
+ 
- 
+ 
+ 
+ 
+ 
Algorithms 1, 2, 3, 4, 5, and 7 thus use a local FIFO 
priority, whereas the other algorithms use a global FIFO 
priority. Likewise, schemes 1, 2, 3, and 6 uses local 
(subFIF0) status information to choose the next “best” 
access. The others use global (FIFO) status information to 
make this decision. of these algorithms, only T9 and T10 
represent viable alternatives to the schemes Tl-T3 and T6. 
Algorithm T6 offers no real advantage as its performance is 
almost identical to Tl’s. Algorithms T7 and T8 perform 
inconsistently in comparison to T1, sometimes yielding 
results several percentage points lower even for relatively 
deep FIFOS. They both perform better for the hydro bench- 
mark and long vectors, but their performance on the other 
benchmarks, and even hydro with shorter vectors, is 
unpredictable, and unimpressive. Algorithm T9, on the other 
hand, only performs worse for very shallow FIFOs and 
memory systems with many banks. In general, its perfor- 
mance is competitive with the schemes from the previous 
section, although Algorithm T4 tends to perform slightly 
better in general for the utilized benchmarks. Algorithm T10 
performs much worse for shallow FIFOs and high interleav- 
ing factors, but if it were sufficiently cheap to implement, it 
might be a reasonable alternative, provided deep FIFOs were 
also implemented. 
FIG. 61 through FIG. 65 illustrate SMC performance for 
long vectors (10,000 elements) as the memory’s DRAM 
page-miss to page-hit cost ratio increases. As before, all 
performance curves are given as a percentage of peak 
bandwidth, thus for the systems with a miss/hit cost ratio of 
sixteen, it’s as if the page-misses required sixteen times as 
long to service. FIG. 61 through FIG. 63 may therefore 
appear a bit misleading, since the miss/hit ratio is likely to 
increase primarily as the result of a reduction of the page-hit 
time, rather than an increase in the page-miss time. At a ratio 
of sixteen, the SMC is delivering a somewhat smaller 
percentage of a much larger available bandwidth which 
results in a significant net increase. To illustrate this, FIG. 65 
shows the performance of hydro for long vectors if the 
page-miss cost is held constant and the page-hit cost 
decreased, increasing the total bandwidth proportionately. 
If the number of modules fixed is held and the page-miss/ 
page-hit cost ratio increased, deeper FIFOs are required in 
order to amortize the page-miss costs. Relative performance 
is approximately constant if FIFO depth is scaled linearly 
with missihit cost. The near-horizontal gray lines in FIG. 
61(a), FIG. 61(c), and FIG. 61(e) highlight this effect. 
40 Consider the hydro benchmark, for example. For an eight- 
bank memory with a miss/hit cost ratio of sixteen, an SMC 
with 256-deep FIFOs delivers 75.11% of peak bandwidth. 
With FIFOs that are 128 deep, the SMC achieves a similar 
performance-75.93%-with a missihit cost ratio of eight. 
45 Likewise, when the miss/hit cost ratio is four and the FIFO 
depth is halved again, the SMC delivers 77.43% of peak 
bandwidth. 
As the interleaving factor grows, so must the FIFO depth. 
This is evident in the results of all benchmarks, including 
SO scale, which nonetheless achieves near-optimal bandwidth 
for all memory systems. Since this computation only 
involves one vector, every access after the first hits the 
current DRAM page. Performance is therefore invariant of 
the miss/hit cost ratio. For computations involving more 
ss than one vector, shallow buffers limit the number of page 
hits over which the SMC can amortize the cost of the 
inevitable page misses. Scale doesn’t suffer from this, but its 
performance on the eight-bank memory system demon- 
strates another problem since with shallow FIFOs, the SMC 
60 cannot prefetch enough data to keep the processor from 
stalling. This inability to adequately overlap memory access 
with computation causes the benchmark to achieve over 
20% less of the attainable bandwidth for eight- or sixteen- 
word buffers than it does for deeper FIFOs. Even the faster 
65 systems, those with a high interleaving factor or a high 
miss/hit cost ratio, still require only modest amounts of 
buffer storage. 
35 
6,154,826 
41 42 
The overwhelming similarity of the performance curves 
presented in the foregoing leads illustrates that neither the 
ordering strategy nor the processor’s access pattern has a 
large effect on the SMC’s ability to optimize bandwidth. In 
fact, the simpler algorithms usually do as well or better than 
their more sophisticated counterparts. For the benchmarks 
and memory systems simulated, algorithms involving a 
“threshold of service” requirement behave inconsistently, 
and generally fail to out perform the simpler schemes. 
Explicitly trying to take advantage of the memory sys- 
tem’s available concurrency by initiating accesses in parallel i o  
(P) turns out to be of no real benefit, and occasionally 
one access at a time, it makes sense to initiate only one 
access each bus cycle. Performance between the “greedy 
(T) scheme is sufficiently similar that deciding which is 
preferable becomes a question of implementation cost. The 
additional complexity of implementing the former (R) 
scheme seems an unjustifiable expense, as the latter (T) 
scheme should prove simpler and indeed faster. 
plexity of the circuitry required to implement each. The 
bank-centric schemes, T4 and T5, give better overall 
performance, but if A1 is sufficiently inexpensive to 
implement, the costiperformance tradeoffs might be worth- 
while. 2s 
The foregoing illustrates that FIFO depth must scale with 
the interleaving factor to achieve good performance on a 
memory system with a large number of banks. Even the best 
ordering algorithms will be stifled by inadequate buffer 
space. When faced with a choice between implementing a 30 
more complicated, and better-performing, access ordering 
scheme and building deeper FIFOs, the latter will generally 
yield better performance. Prefetching can be used in con- 
junction with the SMC to help compensate for the latency in 
FIFO references. 
It has also been demonstrate that an SMC system causes 
no additional delay in responding to normal memory access 
requests, either scalar accesses or cache line accesses. 
Additionally, applications not using the SMC will incur no 
performance penalties. 
be to use an associative buffer memory; only the control for 
accessing the memory would be the necessary. 
The instant disclosure is scalable, allowing for practical 
reorderings for a broad range of scientific computations. 
Thus concurrency can be expanded as needed on the 45 
“memory side” of the SMC as needed, or at least until the 
performance of the SMC itself becomes the bottleneck. At 
that point, SMCs can be replicated on the bus. The infor- 
mation concerning future accesses can be broadcast and 
interpreted by those SMC’s that control memories that 50 
contain data that will be accessed. There does not need to be 
a limit to the aggregate usable bandwidth of the system other 
than the bus itself, which presumably has been built to match 
the CPU. program’s natural order, 
increasing the complexity of the software and therefore the 
compiler is utilized to detect the ability to use the SMC. For 
certain applications, however, it may be beneficial to elimi- 
nate the compiler and incorporate the detection into the 
SMC hardware. The hardware can be designed to read the 
user program and use the SMC for all applications. 
Since other modifications and changes varied to fit par- 
ticular operating requirements and environments will be 
apparent to those skilled in the art, the invention is not 
considered limited to the example chosen for the purposes of 
disclosure, and covers all changes and modifications which 65 
do not constitute departures from the true spirit and scope of 
this invention. memory scheduling unit. 
What is claimed is: 
1. A memory controller for accessing memory, said 
at least one stream buffer, said at least one stream buffer 
control registers, said control registers receiving stream 
a memory scheduling unit for decoupling memory 
memory controller comprising: 
being a FIFO and buffering data, 
parameters from a data processor, 
requests from a processor to enable access of data 
elements in an order that increases effective bandwidth 
over a program’s natural order and that reduces average 
memory latency for data patterns, 
said control registers identifying data streams to be 
accessed and said memory scheduling unit generates 
stream parameters and accesses the data elements in a 
dynamically determined order. 
hinders performance. Given that the SMC can Only process wherein a data processor sends said stream parameters to 
round-robin” (R) scheme and the simp1er “token passing” 1s memory addresses of the data elements based on said 
2. A method of accessing data comprising: 
a data processor, 
memory for storing data for use by said processor, 
a memory controller, said memory controller having 
stream buffers, said stream buffers buffering data, 
control registers, said control registers receiving stream 
a memory scheduling unit for dynamically decoupling, 
accesses of data elements within patterns of memory 
a compiler, said compiler identifying said patterns of 
memory accesses, based on a user program, and gen- 
crating instructions to transmit said memory 
patterns to said memory controller, 
1. compiling user program code, consisting in part of  
a. recognizing stream memory access patterns, 
b. generating machine instructions to cause said data 
processor to dynamically determine stream 
parameters based on said stream access patterns, 
c. generating machine instructions to transmit said 
stream parameters to said memory controller, 
2. initiating execution of a compiled user program by: 
a. executing machine instructions causing said data 
processor to calculate said parameters, 
b. executing said machine instructions in accordance 
with step l(c) causing transmission of said stream 
parameters to said memory controller, 
c. receiving said stream parameters at said control 
registers, 
d. reading stream data elements by: 
The choice Of T4 Or T5 Over depends On the ‘Om- 2o 
parameters from said processor, 
reordering and issuing 
accesses from said memory, 
comprising the steps of: 
3s 
An alternative to the dual controller implemention would 40 
accessing said data elements within said memory 
in an order dynamically determined by said 
memory scheduling unit, 
placing said data elements in said buffer, 
holding said data elements until said data ele- 
ments are requested by said processor in said 
transmitting said data elements to said processor 
upon request for said data elements by said 
processor, 
The current trend is to simplify the hardware while ss 
e. writing data elements by: 
receiving said data elements in said buffer from 
said processor in said program’s natural order, 
holding said data elements in said buffer, 
transmitting said data elements to said memory in 
an order dynamically determined by said 
memory controller, 
wherein data elements are accessed in said 
memory in an order determined by said 
60 
6,154,826 
43 44 
mitted from said processor in said program’s natural order 
until written to said memory in an order determined dynami- 
cally by said memory controller. 
6. The method of claim 2 wherein said step of activating 
s a compiled user program further comprises specifying the 
reading or writing of said data elements to or from said 
memory. 
7. The method of claim 2 further comprising the step of 
dynamically reordering said memory accesses to increase 
i o  effective bandwidth over said program’s natural order and 
reduce averaged memory latency for said data patterns. 
3. The method of claim 2 further comprising the steps of 
using base address, stride, length and access mode as stream 
parameters, said stream parameters being of any length and 
stride representable within the capabilities of said data 
processor. 
4. The method of claim 2 further comprising the step of 
said memory controller reading the data elements in said 
patterns from said memory in an order determined dynami- 
cally by said memory controller, said data elements being 
buffered in said memory controller until said processor reads 
said data elements in said program’s natural order. 
5. The method of claim 2 further comprising the step of 
said memory controller buffering said data elements trans- * * * * *  
