Maximizing resource utilization by slicing of superscalar architecture by Patil, Shruti Ravikant
UNLV Retrospective Theses & Dissertations 
1-1-2006 
Maximizing resource utilization by slicing of superscalar 
architecture 
Shruti Ravikant Patil 
University of Nevada, Las Vegas 
Follow this and additional works at: https://digitalscholarship.unlv.edu/rtds 
Repository Citation 
Patil, Shruti Ravikant, "Maximizing resource utilization by slicing of superscalar architecture" (2006). UNLV 
Retrospective Theses & Dissertations. 2023. 
https://digitalscholarship.unlv.edu/rtds/2023 
This Thesis is protected by copyright and/or related rights. It has been brought to you by Digital Scholarship@UNLV 
with permission from the rights-holder(s). You are free to use this Thesis in any way that is permitted by the 
copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from 
the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/
or on the work itself. 
 
This Thesis has been accepted for inclusion in UNLV Retrospective Theses & Dissertations by an authorized 
administrator of Digital Scholarship@UNLV. For more information, please contact digitalscholarship@unlv.edu. 
MAXIMIZING RESOURCE UTILIZATION BY SLICING 
OF SUPERSCALAR ARCHITECTURE
by
S h ru ti R avikant Patil
Bachelor of Engineering 
V eerm ata J ijab a i Technological In stitu te  
University of M um bai 
2004
A thesis  subm itted  in partial fulfillm ent 
of the  requ irem ents for the
Master o f Science Degree in Engineering 
Department o f Electrical Engineering 
Howard R. Hughes College o f Engineering
Graduate College 
University o f Nevada, Las Vegas 
August 2006
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
UMI Number: 1439974
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy 
submitted. Broken or indistinct print, colored or poor quality illustrations and 
photographs, print bleed-through, substandard margins, and improper 
alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript 
and there are missing pages, these will be noted. Also, if unauthorized 
copyright material had to be removed, a note will indicate the deletion.
UMI
UMI Microform 1439974 
Copyright 2007 by ProQuest Information and Learning Company. 
All rights reserved. This microform edition is protected against 
unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company 
300 North Zeeb Road 
P.O. Box 1346 
Ann Arbor, Ml 48106-1346
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Thesis Approval
The Graduate College 
University of Nevada, Las Vegas
Ju n e  19. 20 06
The Thesis prepared by
S h r u t i  P a t i l
Entitled
"M axim izing  R eso u rce  U t i l i z a t i o n  by S l i c i n g  o f
S u p e r s c a la r  A r c h i t e c tu r e "
is approved in partial fulfillment of the requirements for the degree of 
______________ M a ste r  o f  S c ie n c e  i n  E l e c t r i c a l  E n g in e e r in g
E m m ination ,C om m ittee M em ber
Exam ination C om m ittee M em ber
Graduate College Faculty R epresentative
Exami] ion ^om m m ee lair
Déan o f the G raduate College
11
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ABSTRACT
Maximizing Resource Utilization By Slicing  
Of Superscalar Architecture
by
S h ru ti R avikant Patil
Dr. V enkatesan  M uthukum ar, Exam ination Com m ittee C hair 
Professor of E lectrical and  C om puter Engineering 
U niversity of Nevada, Las Vegas
S uperscalar a rch itec tu ra l techn iques increase in struction  th ro u g h p u t from one 
in struction  per cycle to m ore th a n  one in stru c tio n  per cycle. M odern p rocessors m ake 
u se  of several processing resources to achieve th is  k ind of th roughpu t. Control u n its  
perform  various functions to minimize sta lls  an d  to en su re  a  con tinuous feed of 
in structions to execution u n its . It is vital to en su re  th a t in stru c tio n s  ready for execution 
do not encoun ter a  bottleneck in  the  execution stage.
This thesis  w ork proposes a  dynam ic schem e to increase efficiency of execution 
stage by a  m ethodology called block slicing. Im plem enting th is concept in a  wide, 
su p ersca la r pipelined arch itec tu re  in troduces m inim al additional hardw are and  delay in 
the  pipeline. The hardw are required  for the  im plem entation  of the  proposed schem e is 
designed and  assessed  in term s of cost an d  delay. Perform ance m easu res  of speed-up , 
th ro u g h p u t an d  efficiency have been evaluated  for the  resu lting  pipeline and  analyzed.
I l l
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
TABLE OF CONTENTS
ABSTRACT.................   iii
LIST OF FIGURES.............................................................................................................................. v
LIST OF TABLES................................................................................................................................vi
CHAPTER 1 INTRODUCTION.......................................................................................................  1
1.1 H istory of C o m p u tin g .........................................................................................................1
1.2 A rchitectures & C lassifica tions.......................................................................................4
1.3 Design & Evaluation of a rc h ite c tu re .................................................................. 5
1.4 M otivation for the  R esearch w o rk ................................................................................. 6
CHAPTER 2 PRIOR W ORK..............................................................................................................8
2.1 C lassification of C om puter A rch itec tu res ....................................................................8
2.2 Prior resea rch  on special a rc h ite c tu re s .....................................................................21
CHAPTER 3 SUPERSCALAR PIPELINED ARCHITECTURE................................................28
3.1 DLX A rch itectu re ...............................................................................................................28
3.2 Generic S upersca lar P ipeline........................................................................................ 32
CHAPTER 4 CONCEPTS AND IMPLEMENTATION...............................................................39
4.1 Block S lic ing .......................................................................................................................41
4.2 Sliced ALU Im p lem en ta tio n ...........................................................................................42
4.3 A rchitecture of in teger execution u n i t s .....................................................................46
4.4  Area an a ly s is .......................................................................................................................51
4.5  Im plem entation of DLX Sliced Processor u sing  VHDL......................................... 52
CHAPTER 5 RESU LTS...................................................................................................................54
5.1 Time of Execution and  S peed-U p...................  55
5.2 E fficiency............................................................................................................................. 56
5.3 T h ro u g h p u t......................................................................................................................... 57
5.4 Power-Delay P ro d u c t....................................................................................................... 57
CHAPTER 6 CONCLUSIONS AND FUTURE WORK............................................................. 61
6.1 C o n c lu s io n s .....................................................................  61
6.2 F u tu re  W ork........................................................................................................................61
R EFEREN CES.................................................................................................................................. 63
VITA...................................................................................................................................................... 65
IV
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
LIST OF FIGURES
Figure 1. C om puter G enerations in  50 years .............................................................................3
Figure 2. The appearance  of su p e rsca la r p rocessors on a  tim eline ..................................17
Figure 3. Evolution of Com m ercial S upersca lar P rocesso rs ................................................. 18
Figure 4. Pipeline stages in a  DLX a rc h ite c tu re ........................................................................29
Figure 5. Generic S upersca lar Pipeline S ta g e s .........................................................................32
Figure 6. Superscalar, Pipelined DLX Im plem entation  in VHDL........................................35
Figure 7. Block diagram  of in teger u n it  in  VHDL im plem entation of su p e rsca la r DLX36
Figure 8. Dataflow diagram  du ring  sim ulation  of VHDL im plem entation of proposed
c o n c ep ts ............................................................................................................................... 37
Figure 9. P rocessing stages for u sing  a  sliced ALU im plem en ta tion ................................. 43
Figure 10. Block diagram  of a  sliced ALU......................................................... ,............................44
Figure 11. S teps of operation of a  sliced ALU..............................................................................45
Figure 12. Two in terconnected  4-b it a d d e r /s u b tra c te r  u n its  form ing one 2 -slice
a d d e r /s u b tra c te r ...............................................................................................................47
Figure 13. A rchitecture of flexible a d d e r / su b tra c te r  u n i t .......................................................47
Figure 14. Block diagram  of a  flexible a d d e r /s u b tra c te r  u n i t ................................................48
Figure 15. Block diagram  of a  c o m p a ra to r...................................................................................49
Figure 16. Four 8-bit com pare slices for signed or unsigned  com parison  ....................50
Figure 17. M ultiplexer im plem ented u s in g  p a ss- tra n s is to r  lo g ic ......................................... 51
Figure 18. W aveforms of sim ulations for A L U instructions-Partl .ou t for DLX processor
w ith (a) non-sliced ALU an d  (b) sliced ALU...............................................................58
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
LIST OF TABLES
Table 1 DLX in s tru c tio n s ................................................................................................................30
Table 2 Average of MIPS dynam ic in stru c tio n  mix in  SPECint2000 an d  SPECfp2000
benchm ark  su ite ................................................................................................................ 37
Table 3 Usage of ALU u n its  in b e n c h m a rk s ............................................................................. 40
Table 4 T ru th  Table for decoder th a t d irects the  o u tp u t of four slices into re su lt
re g is te r ..................................................................................................................................48
Table 5 Additional hardw are u sed  for slicing of ALU............................................................. 52
Table 6 R esults of evaluation of Time of Execution and  S p e e d -u p .................................. 55
Table 7 Efficiency............................................................................................................................... 56
Table 8 T hroughput in term s of Instruction  Per Fetch Cycle..............................................57
Table 9 Power Delay Product for execution of two w orst-case operations for two 16-
b it w orst-case o p e ra n d s .................................................................................................. 58
VI
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ACKNOW LEDGEMENTS
G raduate  school h a s  been  a  jo u rn ey  w ith classic experiences th a t  have b rough t 
abou t a  great change in  m y resea rch  an d  m y life. I am  thankfu l for the  tim e I sp en t a t 
UNLV.
I feel extrem ely fo rtunate  to have h ad  Dr. “Venki” M uthukum ar as my advisor. He is 
the  m an who h as  m entored me endlessly  an d  helped me define an d  refine my research  
ideas w ith an  incredible am o u n t of su p p o rt a n d  patience. As is a p p a re n t in  the  first few 
in terac tions, h is knowledge in a  wide range of a ren as  brings him  an  all-round 
personality  topped by a  w itty sense of hum or. The su p p o rt an d  dedication he h a s  for his 
s tu d e n ts  is m ore th a n  w hat any  s tu d e n t can  a sk  for. I am  extrem ely grateful for the 
research  tools and  the  co n stan t guidance th a t  he h a s  given me.
I am  thankfu l for the  w onderful laboratory  environm ent sparked  by my lab-m ates, 
Ashwini Raina, Naveen C hin thalcheruvu , G opinath  B alak rishnan  an d  S h a n k a r 
N eelakrishnan. It is only in a  conducive environm ent th a t out-of-box th ink ing  is 
cu ltu red . Friendliness and  the  ability for h a rd  w ork th a t  has becom e a  m ark  of all Lab 
B348 inm ates w as rubbed  off on m e w hen I jo ined the  research  team . I am  grateful for 
m y in terac tions w ith my friends, A m ruta  Tilaye, R a thna  Ram asw am y, Pradeep 
N am bisan and  Navin Veerm isti who have helped me a t  various poin ts professionally 
an d  personally.
Lastly, I would like to th a n k  m y p a ren ts , s is te r and  my g randm other for being 
supportive of my academ ic journey. They have been  extrem ely u n d e rs tan d in g  an d  a  
co n stan t source of encouragem ent an d  joy.
V ll
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 1 
INTRODUCTION
As we move to the  GSI (Giga-Scale Integration) era, the challenges p resen ted  to  a  
com puter arch itec t increase in co n stra in ts  an d  complexity. C u rren t dem ands of 
technological advances have spu rred  an  exceptional developm ent in  th e  way com puters 
a re  designed. Advances in  com puter arch itec tu re  sp an  across the  concepts of out-of- 
o rder su p e rsca la r a rch itec tu res, aggressive speculative techn iques, high bandw id th  
caches, etc. to d istribu ted  processor a rch itec tu res.
The nex t section briefly traces the  developm ent of a rch itec tu res an d  resea rch  on 
m odern a rch itec tu res  with enhanced  perform ance and  capabilities. Among the  factors 
th a t a re  sough t to be con tinuously  im proved in  a  m achine are  clock speeds and  
in struction  th ro u g h p u t. This resea rch  w ork proposes a  dynam ic way to increase 
in struction  th ro u g h p u t, by concen tra ting  on the  processing elem ents of a n  a rch itec tu re  
and  adding flexibility so th a t  the  processing  bottleneck  is addressed .
1.1 History of Com puting
D uring the  late forties, com puters were m ostly developed as a  m achine perform ing 
logic an d  arithm etic  operations u s in g  vacuum  tubes . There w as a  need for su itab le  
electronic hardw are arch itec tu re  w hich w as m uch  more efficient th a n  the  existing 
electro-m echanical devices (ENIAC). This led Jo h n  Von N eum ann to form ulate a  
m achine controlled by a  sequen tial program  stored in  electronic m em ory along w ith the  
da ta , an d  it cam e to be know n as EDVAC (Electronic D iscrete Variable A utom atic 
Com puter). W ith N eum ann’s a rch itec tu re  as the  base, new a rch itec tu ra l concepts were 
conceived by in tegrating  software w ith the  hardw are. W ith the  invention of
1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
program m ing languages and  operating system s, the  com puting power h a s  increased  
from few h u n d red s  to several billion in stru c tio n s  per second. Program  sizes increased  
from few th o u sa n d s  to several m illions of code lines. C om puters cam e to be seen  as 
both  special pu rpose  m ach ines executing a  p a rticu la r program  an d  as  universal 
m ach ines capable of sim ulating  any  special purpose  m achine.
The M anchester university  com puter science group developed the  idea of indexed  
modification o f  addresses  and  the mem ory hierarchy in 1949. The index registers 
perm itted  the execution of loops w ithou t modifying the  in struction  add resses  an d  the  
m em ory h ierarchy  idea led to the  developm ent of caches and  v irtua l m ach ines concept. 
In 1951, Wilkes proposed the  microprogrammed control, as a  system atic  way of 
controlling the  operation of com puters. S tack  architecture w as proposed by B arton  in 
1958 as a  tool for com piling and  executing expressions. This resu lted  in  the  m achine 
a rch itec tu re  reflecting the  organization of the  program m ing language. The late  fifties 
saw  the developm ent of multiprocessors w ith  separa te  I /O  processors  an d  arithmetic 
processors.
Vector processors  provided efficient m achine operations involving d a ta  s tru c tu re s . 
Cray-I developed in 1973 is an  exam ple of vector supercom puter. W ith the  advent of 
vector processors, pipelined architectures cam e into existence, w hich have since becom e 
the  backbone of su b seq u en t arch itec tu res . The pipelined a rch itec tu re  ob tains faster 
operations by decom posing each  operation  into step s  to be executed by cascaded  s u b ­
u n its . Systolic array  arch itec tu re  evolved from pipelined a rch itec tu re  characterized  by 
identical processing  elem ents connected in  a  linear or a  m ulti-d im ensional a rray  where 
in  each  processing elem ent is connected only to its ad jacen t elem ents only.
In 1972, the  increasing  density  of the  com ponent on a  chip u sing  VLSI techn iques 
an d  the corresponding  lower costs resu lted  in im plem entation of a  com plete processor 
on a  single chip, know n as microprocessor. F u rth e r  increase in com ponent density  h as  
led to the  evolution of m icroprocessors w ith complex in struction  se t (CISC) and  m ore 
functionality  on hardw are. B u t th is  also resu lted  in  slow down of the  processor speed.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
G eneration Technology and  A rchitecture
Software and  O perating 
System Exam ple System s
First
1946-1956
V acuum  Tubes and  
Relay Memory, Single­
bit CPU
M achine/assem bly  
language, program s 
w ithout sub rou tines
ENIAC, IBM 701, 
Princeton IAS
Second
1956-1967
Discrete T ransisto rs, 
core mem ory, floating­
point accelerator, I /O  
channels
Algol and  Fortran  w ith 
com pilers, ba tch  
processing OS
IBM 7030, 
CDC1604, Univac 
LARC
Third
1967-1978
Integrated  C ircuits, 
Pipelined CPU, 
m icroprogram m ed CU
C language 
m ultiprogram m ing, 
tim esharing  OS
PDP-11,
IB M 360/370,
CDC6600
Fourth
1978-1989
VLSI m icroprocessors, 
m ultiprocessors, vector 
supercom puters
Sym metric 
m ultiprocessing, 
parallelizing compilers, 
m essage-passing  
libraries
IBM PC, VAX 
9000, Cray X/M P
Fifth
1990-
p resen t
ULSI circu its, scalable 
parallel com puters, 
w orkstation  clusters, 
in ternet
Java , m icrokernels, 
m ultith reading , 
d istribu ted  OS, WWW
IBM SP2, SGI 
Origin 2000 
Digital T ruC luster
Figure 1. C om puter G enerations in 50 years [22]
To increase  the  speed of processing  an d  reduce the n u m b er of in structions , 
a rch itec tu res w ith reduced  in struction  se t (RISC) were im plem ented w ith sim ple 
circuitry. The invention of CISC an d  RISC a rch itec tu res form ed the  baseline for the 
b u rs t of several new a rch itec tu res w hich led to the  b irth  of new generation  know n as 
superscalar architectures. S upersca lar p rocessors have the ability to process several 
in stru ctio n s  in  the  sam e in struction  cycle based  on w hether a n  in struction  is an  
independen t in struction  or dependen t on ano ther.
Figure 1 show s the  five generations of com puters [22] concisely, w hich depict five 
d istinc t developm ent p h ases  in  the  com puter industry .
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1.2 A rchitectures & C lassifications
The parallel processing ability of the  sup e rsca la r a rch itec tu res resu lted  in m any 
different a rch itec tu res  an d  it is im perative th a t  we classify the  a rch itec tu res  into 
various categories based  on th e ir features.
1.2.1 C lassification based  on Instruction  Set complexity
■ Complex In struction  Set C om puter (CISC)
■ Reduced Instruction  Set C om puter (RISC)
■ M inimal Instruction  Set C om puter (MISC)
■ High Level In struction  Set C om puter (HI SC)
■ W ritable in struction  Set C om puter (WISC)
■ Zero In struction  Set C om puter (ZISC)
■ Veiy Long Instruction  Word (VLIW)
1.2.2 F lynn’s Taxonom y based  on parallelism  in in struction  and  d a ta  s tream s
■ Single Instruction  Single D ata  s tream  (SISD)
■ Single Instruction  M ultiple D ata  stream  (SIMD)
■ M ultiple In struction  Single D ata  stream  (MISD)
■ M ultiple In struction  M ultiple D ata  s tream  (MIMD)
■ Centralized S hared  Memory
■ D istribu ted  Memoiy
1.2.3 C lassification based  on in ternal storage of operands
■ S tack  A rchitecture
■ A ccum ulator A rchitecture
■ Load-Store A rchitecture
■ R egister-M em ory A rch itecture
■ Memory-Memory A rchitecture
■ Extended A ccum ulator /  Special Purpose Register A rchitecture
1.2.4 C lassification based  on application
■ G eneral Purpose A rchitectures
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
■ A pplication Specific A rchitectures
1.2.5 Recent classification based  on ability to exploit In struction  Level Parallelism
■ Scalar, Non-pipelined
■ Scalar, Pipelined
■ S uperscalar, Non-pipelined
■ S uperscalar, Pipelined
■ S uperscalar, Superpipelined
1.3 Design & Evaluation of a rch itec tu re
The essen tia l elem ents of a  processor are d a tap a th s , in stru c tio n  se t and  control 
u n it. A d a ta p a th  is e ither designed w ith general processing elem ents th a t  process all 
incom ing ta sk s  or it is designed to hand le  specific ta sk s  u sing  specialized com ponents. 
The d a ta p a th  controls the processing  abilities in th e  arch itec tu re . An in stru c tio n  se t is 
th en  required  to be designed for the  processor. An instruction  generally consists  of a  
field to specify operations to be perform ed and  one or more fields to specify d a ta  to 
perform  the  operations on. The in stru c tio n s  m ay also be designed to provide control 
inform ation to the  processor to execute the  operations in an  efficient m anner. In th is 
case, there  is an o th e r field called the  control filed th a t  con ta ins pre-determ ined  control 
b its. The control u n it  generates control signals th a t  allow a  concu rren t functioning of 
different m odules in d a ta p a th s  and  enable the  processor to o u tp u t re su lts  tim ely and  
correctly.
Various designs for a rch itec tu res have been  developed over the  years. M ost are 
described in the  classifications listed  above. A new arch itec tu ra l design is generally 
required for e n h a n c in g  cu rren t perform an ce, to  im part n ew er ca p a b ilities  to a n  ex ist in g  
arch itec tu re  or to exploit the  la te s t circuit-level technology.
The im pact th a t  a  newly designed a rch itec tu re  will have on ta sk s  an d  program s 
needs to be evaluated  in order to successfully  p u t it to u se  in p ractical applications. 
B enchm ark  su ites  contain ing  program s th a t  rep resen t a  variety of application ta sk s
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
have been  developed to a sse ss  the  perform ance of a rch itec tu res u n d e r  different 
environm ents. C ertain  desirable properties have been  identified for perform ance m etrics 
to evaluate a rch itec tu res. These are  linearity, reliability, repeatability , ease of 
m easurem ent, consistency an d  m easu rem en t as described in  [2]. The perform ance 
m etrics listed below have been  extensively u sed  since decades for reflecting perform ance 
of new arch itectu res;
■ Clock frequency
■ Millions of in stru c tio n s  executed per second (MIPS),
■ Millions of floating poin t in stru c tio n s  executed per second (MFLOPS)
■ Execution Time
■ Speed-up w ith respect to o ther system s
Once an  arch itec tu re  h a s  been  designed, it is analyzed for cost (in term s of gate 
equivalents) an d  m axim um  clock frequency. B enchm ark  program s are ru n  on the 
m achine an d  th e ir execution tim e gives a n  indication of the quality  of the  a rch itec tu re  
for the a rea  of applications rep resen ted  by specific benchm ark  program s. These m etrics 
aid  in com paring different a rch itec tu res  an d  facilitate the  choice of an  a rch itec tu re  for 
an  application a t  hand .
1.4 M otivation for the R esearch w ork
W ith advancem ents in  VLSI design tools and  fabrication techn iques, the  chip a rea  
available to im plem ent complex com puter a rch itec tu re  h as  increased  exponentially. 
This increase in a rea  can  be u sed  e ither to accom m odate m ore n u m b er of m odules, 
m odules of increased  complexity an d  functionality  or a  com bination of bo th . While 
acknowledging the  available la titude  of chip a rea , th is thesis  explores w ays of 
increasing  the  efficiency of m odules on the  chip by in troducing  additional 
functionalities to existing m odules.
M ost m odern-day p rocessors have a  d a ta  w idth of 64-bits. It is possible to efficiently 
u se  the processing  elem ents to operate  on d a ta  of sm aller word sizes. A schem e called
6
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
as  block slicing is proposed in  th is  w ork to increase in struction  th ro u g h p u t w hen su ch  
d a ta  is encountered . The schem e is applied  to functional u n its  to increase execution 
parallelism  in wide supersca la r, pipelined a rch itec tu res. This techn ique will be m ore 
effective in general pu rpose  m ach ines an d  will lead to a  h igher processing  ra te , w ithou t 
increasing  processing  u n its .
The schem e of block slicing, its design, im plem entation an d  evaluation  have been 
elucidated in th is  thesis. This docum ent is organized as follows. C hap ter 2 p resen ts  a  
litera tu re  review of com puter a rch itec tu res  an d  their applications. C hap ter 3 describes 
the  su p ersca la r arch itec tu ra l techn ique for exploiting In struction  Level Parallelism  (ILF) 
an d  the DLX arch itec tu re  designed for academ ic pu rposes and  described  in [1]. C hap ter 
4 describes the  concepts in troduced  by th is thesis  work an d  th e ir  design and  
im plem entation as  pipelined u n its . C hap ter 5 evaluates the  concepts and  p resen ts  the 
resu lts  of sim ulations. C hap ter 6 p resen ts  conclusions from th is  w ork an d  proposes 
fu tu re  w ork th a t rem ains to be perform ed an d  evaluated.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 2 
PRIOR WORK
C om puter a rch itec tu res have advanced from the  ENIAC (Electronic Num erical 
In tegrator And Com puter) to p resen t day m ulti-core processors. C lassification of 
com puter a rch itec tu res is n o t only necessary  to determ ine an  optim al design for a  
system , b u t also to system atize the  sam ple space for arch itec tu ra l exploration and  
progression. There are  several categories u n d e r  w hich arch itec tu re  can  be classified. 
Some of the  conventional classification m ethods are  based  upon  the  complexity of the 
in struction  set, operand  storage, application and  in struction  processing  schem e. O ther 
criteria  like cost, capacity, perform ance an d  com ponent density  have also been  u sed  in 
the  p as t to provide a  basis  for classification. A part from these  categories, lately new 
classification m ethodologies based  upon  n u m b er of storage h ierarchy  levels, n u m b er of 
addressab le  fields, fau lt to lerance of the  system  and  reconfigurability are  being u se d  to 
com pare perform ance of upcom ing a rch itec tu res.
This ch ap te r d iscusses conventional com puter arch itec tu re  classifications followed 
by a  descrip tion of different types of sup e rsca la r processors. F ourth  generation  
processors are  also briefly explained in th is  section. A survey of existing lite ra tu re  is 
p resen ted  a t the  end of th e  chapter.
2 .1  C lassifica tio n  o f C om p u ter A rch itectu res
A rchitectures can  be categorized based  on a  n u m b er of b road  classification criteria. 
M ost com m ercial p rocessors fall into a  n u m b er of these  criteria. It is possible to m ake a  
narrow er classification for advanced p rocessors like parallel p rocessors, d istribu ted  
p rocessors an d  netw ork processors.
8
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2.1.1 C lassification based  on Instruction  Set Complexity
1. Complex Instruction  Set C om puter (CISC):
The CISC in struction  se t com prises of several RISC operations in  a  single 
in struction . This reduces the  lines of code for a  program  and  gives the  designer the  
ability to optim ize m ultiple in stru c tio n s  in a  single step. A reduction  in  in stru c tio n s  
leads to lower m em ory requ irem en ts and  fewer m em oiy accesses. However, the 
functions of the  in struction  decoder stage intensify to a  large extent. Typically, the  
nu m b er of in stru c tio n s  in  a  CISC m achine is 80-150. The m ain  fea tu res of a  CISC 
system  include reg ister to m em oiy and  m em ory to reg ister in stru ctio n s , m ultiple 
addressing  m odes for m em oiy, two operand  form at, variable length  in stru ctio n s  and  
m any clock cycles per in struction . The CISC arch itec tu re  is characterized  by a  
complex in stru c tio n  decode logic, a  sm all nu m b er of general pu rpose  reg isters an d  
several special purpose  registers.
2. Reduced In struction  Set C om puter (RISC):
The RISC arch itec tu re  su p p o rts  sim ple basic  in structions th a t  can  be com bined 
to achieve complex ta sk s  an d  capable of ru n n in g  faster th a n  CISC in structions . The 
in struction  decoder design is simplified due to the  n a tu re  of in stru ctio n s , a n d  hence 
control p a th  design process is uncom plicated . The RISC a rch itec tu re  enables a  
com puter a rch itec t to exploit in stru c tio n  parallelism  an d  out-of-order execution. 
RISC processors have complex m em ory h ierarchy  in order to w ork a t full speed and  
allow for u n in te rru p ted  pipeline flow.
These p rocessors are  often classified based  on various m easu res  like the 
d a ta p a th  w idth, pipeline w idth, word size, cache s tru c tu re , b u s  s tru c tu re , type of 
buffers and  types of reg ister files.
3. M inimal Instruction  Set C om puter (MISC) [4]:
The MISC arch itec tu re  is m ade to exploit sim plicity by assum ing  only 32 
in structions. As the  speed of the  RISC processors increases, a  bo ttleneck  is created  
betw een the  processor an d  the m em ory. A cache m em oiy is n ecessa iy  to buffer
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
in struction  a n d  d a ta  s tream s in  o rder to increase the  m em oiy access speed. Cache 
m em oiy com plicates the  system  design a n d  m akes the  system  m ore expensive. RISC 
processor is also veiy inefficient in  hand ling  subrou tine  calls an d  re tu rn s . A large 
reg ister window big enough to hand le  inpu t, o u tp u t and  local param eters  is u sed  to 
a ss is t in  sub rou tine  calls. This large reg ister window w astes the  m ost valuable 
resource in  the  RISC processor and  slows the  system  during  context sw itching. 
MISC is im plem ented w ith only four in struction  groups: tran sfe r in structions , 
m em ory in structions , arithm etic  in stru c tio n s  an d  register in structions.
4. High Level In struction  Set C om puter (HISC) [5]:
HISC is 64 b it arch itec tu re . It involves sim ple in structions  of fixed length, 
en tries of operand  descrip tors and  application oriented d a ta  types. The operands of 
an  in struction  are  described by O perand  D escrip tors w hich are  records and  consist 
of v irtual add resses, d a ta  types, operand  sizes, vector inform ation, operand  access 
codes and  design and  system  dependen t inform ation for the  operand . The d a ta  
types of the  operands include integer, floating-point num ber, BCD, cha rac te r and  
string. The vector inform ation includes n u m b er of elem ents in  the  vector an d  the 
elem ent spacing  for vector operands.
HISC reduces the  dem and  for conditional b ranch ing  as  in  RISC by elim inating 
the looping co u n t for operands of variable lengths an d  large size, a s  well a s  vectors. 
On the  o ther hand , HISC will operate super-sca la r on a  h igher level. The 
in terdependency  of operands will be m uch  less while it is likely to operate su p e r­
scala r for two or m ore function  u n its . HISC also keeps the  vector inform ation so 
th a t  vector operations are done by hardw are.
This is a  general purpose  a rch itec tu re  targeted  on high perform ance, 
im plem entation flexibility, expandability , b e tte r  access control an d  system  
dependen t features. HISC processor provides b e tte r encapsu la tion  an d  is be tte r 
su ited  for m ultim edia  applications.
10
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. W ritable In struction  Set C om puter (WISC):
W ritable Instruction  Set C om puter is a  s tack  based  a rch itec tu re  w hose design is 
based  on VLSI design m ethodology. These stack  m ach ines offer processor 
complexity th a t  is m uch  lower th a n  th a t  of CISC m achines and  overall system  
complexity th a t  is lower th a n  th a t of e ither RISC or CISC m achines.
Earlier, s tack s  were placed in  program  m em oiy chips. WISC m ain ta in s  separa te  
m em ory ch ips or even on-chip m em oiy for the  stacks. This configuration provides 
extrem ely fast sub rou tine  calling capability  an d  superior perform ance for in te rru p t 
hand ling  an d  ta sk  switching. WISC com bines stack  m ach ine design w ith 
opportunities offered by VLSI fabrication technology. This com bination produces 
sim plicity an d  efficiency. M ultiple s tack s  w ith hardw are s tack  buffers, zero-operand 
s tack  oriented in struction  se ts  an d  the  capability  of fast procedure calls lead to 
featu res like high perform ance w ithou t pipelining, sim ple logic and  low system  
complexity, sm all program  size, fast program  execution an d  low in te rru p t response 
and  a  low cost for context sw itching. A successfu l application a rea  for WISC is real 
tim e em bedded control environm ents.
6. Zero In struction  Set C om puter (ZISC) [6]:
ZISC is a  n eu ra l netw ork based  in tegrated  circuit w hich is designed for 
applications u s in g  su p e r com puters. ZISC u ses  accum ulated  knowledge to recognize 
and  classify objects and  take  decisions. It lea rns by exam ples from  sam ples of da ta . 
The bu ilt-in  learn ing  m echan ism  accum ula tes knowledge du ring  the tra in ing  w hen 
exam ples and  their so lutions are en tered . ZISC h as  generalization capability  w hich 
gives the  capability  to reac t to objects w hich were no t p a rt of the  learn ing  exam ples.
ZISC’s learn ing  capability is no t lim ited in tim e and  volume. Its ch ips can  be 
cascaded  to create a  larger system  w hich en su res  th a t the  system  a rch itec tu re  
ca ters to the  increase in technology density . Several chips can  be linked together to 
build  a  w ider netw ork, w ithou t adding  logic. These featu res m ake ZISC very easy  to 
u se  an d  capable of solving problem s w hich are  no t clearly defined.
11
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ZISC h a s  a  high perform ance, capable of operating in  real tim e an d  can  be u sed  
in p a tte rn  recognition an d  classification. It also h a s  the  ability to separa te  noise 
from signal an d  th is  m akes it perfect platform  for signal processing.
7. Very Long Instruction  Word (VLIW):
Scheduling the  in stru c tio n s  is the  core problem  in a  m odern  processor design. 
VLIW design provides an  alternative by letting  the  software do all the  scheduling. 
The com piler exam ines the  program , finds all th e  in stru c tio n s  w ith no 
dependencies, strings them  together in  a  very long b a tch  an d  executes them  
concurren tly  on an  equally big a rray  of function  u n its  su ch  th a t  all the  function 
u n its  are  u sed  efficiently.
Very long in stru ctio n s  are typically betw een 256 and  1024 b its  wide. These 
in structions  con ta in  m any sm aller fields, each  of w hich directly encodes an  
operation for a  p a rticu la r function un it. The hardw are involved is very sim ple, 
consisting of a  collection of function  u n its  w hich include adders, m ultip liers, and  
b ranch  u n its  etc, connected by a  b u s , p lus som e registers an d  caches. More silicon 
is u sed  in ac tu a l processing and  hence VLIW processor ru n s  fast a s  the  only lim it is 
the latency of the  function un it. Due to its ability for scientific n u m b er crunch ing , 
VLIW m achines are highly u sed  in  scientific a rray  processing an d  signal processing.
2 .1.2 F lynn’s Taxonomy
Flynn categorized all system s based  on parallelism  in the  in struction  an d  d a ta  
s tream s which are sim ultaneously  active a t  the  bottleneck  com ponent of the 
m ultiprocessor system . All com puters are  placed into four different categories:
1. Single Instruction  Single D ata  (SISD) Stream :
This is the  c lass of conventional, sequen tial Von N eum ann m achines, in  w hich 
only one in stru c tio n  consum ing a  restric ted  am o u n t of d a ta  is allowed to execute a t 
a  tim e. All s ta te  changes due to the  in struction  m u st be com pleted before the 
execution of next in struction  begins. This category is the  un ip rocesso r category.
12
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2. Single In struction  M ultiple D ata  (SIMD) Stream :
M ultiple processors execute the  sam e in struction  using  different d a ta  stream s. 
Each processor h ad  its own d a ta  m em ory b u t there  is a  single in stru c tio n  m em ory 
and  control processor, w hich fetches an d  d ispatches in structions . Here, only one 
in struction  can  be executed a t  a  tim e b u t  the  sta te  changes induced  by the  
in struction  m ay be large. Parallelism  is exploited by perform ing the  sam e operation 
concurren tly  on m any pieces of da ta . Vector a rch itec tu res belong to th is  c lass of 
com puters.
3. M ultiple Instruction  Single D ata  (MISD) Stream :
No com m ercial m ultip rocessor system s of th is  type exist to date. Some special 
purpose s tream  processors u se  th is  arch itec tu re  as there  is only a  single d a ta  
stream  to be operated  on by functional u n its .
4. M ultiple Instruction  M ultiple D ata  (MIMD) Stream :
This c lass includes all parallel m ach ines w hich contain  m ultiple p rocessors each  
with its own program  counter. E ach  processor fetches its own in stru c tio n s  and  
operates on its  own data . Different operations m ay be perform ed concurren tly  on 
m any pieces of da ta . The p rocessors in  the  m ultiprocessor system  are often taken  
off-the-shelf.
Due to its flexibility and  cost perform ance factors, MIMD type of a rch itec tu re  
h as  clearly em erged as the  m ost preferred a rch itec tu res for general pu rpose  
m ultiprocessor system s. MIMD m ultip rocessors are  divided into two different 
classes based  on the  nu m b er of p rocessors u sed , the organization of the  m em oiy 
and  the  in terconnection  strategy:
5. C entralized  S h ared  M em ory A rch itectures:
In th is  type of a rch itec tu res, typical p rocessor coun t w ould be few dozens. A 
single centralized m em oiy is connected  to the  processors u s in g  a  single b u s  w hen 
the processor co u n t is less. By replacing the  single b u s  w ith m ultiple b u ses , the  
centralized m em ory can  be scaled to hand le  m ore nu m b er of processors. These
13
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
m ultiprocessors are  called Sym m etric M ultiprocessors (SMP) b ecause  of its  single 
m em ory a n d  its sym m etric rela tionsh ip  to all the  p rocessors and  the  uniform  access 
tim e from any  processor. This style of a rch itec tu re  is also know n as  Uniform Memory 
A ccess  (UMA).
6. D istribu ted  Memory M ultiprocessor A rchitecture [1]:
W hen the  nu m b er of p rocessors involved is large, a  centralized m em ory system  
would not be able to su p p o rt the  bandw id th  dem ands of p rocessors w ithou t 
incurring  excessively long access latency. Hence, m em ory m u st be d istribu ted  
am ong the  p rocessors ra th e r  th a n  being centralized. There are  two m ajor benefits of 
having a  d istribu ted  m em oiy system . F irst, th is  model reduces the  latency  for 
accesses to the  local m em oiy and  th e  second, proves to be a  cost effective way of 
scaling the m em ory bandw id th  w hen m ost of the  accesses are  to the local m em oiy.
2.1 .3  C lassification based  on storage of operands [1]
The type of in te rnal storage of the  operands in a  processor is the  m ost basic 
differentiation u sed  for classifying the  a rch itec tu res. These are  explained below:
1. S tack  A rchitecture: All operands accessed  by th is type of arch itec tu re  are  stored 
in a  stack . An operation is perform ed by tak ing  operands from the  top of the  stack.
2. A ccum ulator A rchitecture: This a rch itec tu re  implicitly accepts an  operand  stored 
in a  special reg ister called as a n  accum ulato r, and  the second operand  is stored into 
a  register. The resu lt of an  operation is also stored implicitly in  the  accum ulato r. 
The advantage of th is schem e is th a t  the  ad d ress  of only one operand  needs to be 
specified while perform ing an  operation.
3. Load-Store A rchitecture: In th is  c lass of com puters, m em ory access is only 
p o ss ib le  w ith  load  an d  store  in str u c tio n s .
4. Register-M emoiy A rchitecture: Here, m em oiy access is possible a s  p a rt  of any 
instruction .
5. M emoiy-M emory A rchitecture: This is a  th ird  class of a rch itec tu re , no t found 
commercially. All operands are  stored  an d  accessed  from the  m em ory itself.
14
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6. E xtended A ccum ulato r/ Special Purpose Register A rchitecture; There are  m ore 
registers p resen t in  th is  arch itec tu re  th a n  a  single accum ula to r b u t restric tions are 
placed on the  u se  of these  special registers. Such  arch itec tu re  is know n as  extended 
accum ula to r or special purpose  reg ister arch itec tu re .
2 .1 .4  C lassification based  on application
A rchitectures can  also fall in to  two categories based  upon  the application they can  
process: general-purpose, application-specific an d  parallel p rocessors [7].
1. G eneral Purpose A rchitectures
These k inds of a rch itec tu res can  perform  a  variety of ta sk s , an d  are  the  basis for 
m ost Intel p rocessors in a  desk top  m achine. This is achieved by break ing  down the 
ta sk s  to a  generic in stru c tio n  se t w hich is supported  by the  a rch itec tu re .
2. Application Specific A rchitectures
These a rch itec tu res  are  targeted  tow ards a  specific application, or a  family of 
applications. Some application-specific a rch itec tu res have been  bu ilt for digital 
signal processing, image processing an d  mixed signal processing. Every m odule in 
su ch  arch itec tu re  is im plem ented to perform  a t m axim um  efficiency an d  least 
redundancy . The in struction  se t is also custom ized for the application.
2 .1 .5  C lassification based  on In struction  Level Parallelism
Instruction  Level Parallelism  (ILP) denotes a  processor's ability to ru n  m any 
in structions a t  the  sam e tim e. Exploiting ILP h as  led to the evolution of su p ersca la r 
pipelined p rocessors from a  basic  scala r processor. Today, ILP h a s  becom e a  m ajor 
factor a round  w hich processors are  designed. The A m dah l’s  Law  is generally u sed  to 
quantify  a  p rocesso r’s perform ance based  on ILP. S upersca lar processors, w hen 
subjected  to A m dahl’s Law increased  perform ance by a  great m agnitude.
The types of a rch itec tu res  based  on th e ir  ability to exploit ILP are:
■ Scalar, Non-Pipelined
■ Scalar, Pipelined
■ S uperscalar, Non-pipelined
15
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
■ Superscalar, Pipelined 
* Superscalar, Superpipelined
1. Scalar, Non-Pipelined A rchitecture:
A rchitectures w ith a  th ro u g h p u t of 1 Instruction  per Clock Cycle (IPC) are 
term ed as  scala r a rch itec tu res  and  these  rep resen t the  sim plest c lass of com puters. 
A rchitectures like Intel8085™  th rough  Intel386™  were scala r an d  non-pipelined 
a rch itec tu res, w ith least clock speeds am ong all o ther categories in  th is 
classification.
2. Scalar, Pipelined A rchitecture
By in terconnecting  the  different p h ases  th a t  an  in struction  undergoes during  
the tim e it arrives into the  processor till the  tim e it leaves it, it is possible to gain 
processing speed. The p h ases  are  scheduled  so th a t each  p hase  proceeds in  a  lock­
step  fashion, m uch  like the  assem bly  line processing  in an  autom obile factory. Such 
a rch itec tu re  is called a  pipelined arch itec tu re . Pipelining increases clock speeds by a  
factor of the  n u m b er of stages th a t  are  included in  it. A typical pipeline, show n in 
Figure 1 consists of six stages: fetch, decode, read  registers, execute, w riteback, 
write to m em oiy.
3. Superscalar, Non-Pipelined arch itec tu re
An a rch itec tu re  th a t  is capable of processing m ore th a n  one in struction  per 
cycle is called supersca la r. Such  an  IPC is ob tained  by having m ore th a n  one copy 
of processing elem ents. There is no m achine th a t  is su p ersca la r and  non-pipelined. 
This category only exists for the  sake of com pleteness.
Figure 2 show s the  tim e periods w ithin w hich su p ersca la r a rch itec tu res  were 
designed.
4. Superscalar, Pipelined A rchitecture
S upersca lar a rch itec tu res were bu ilt w ith  a  view to ex tract parallelism  from d a ta  
and  in structions. M ultiple in stru c tio n s  and  d a ta  are fetched sim ultaneously  and  
out-of-order execution is enabled to reduce stalls. Additional hardw are an d  stages,
16
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
R I S C  P r o c e s s o r s
In te l9 6 0  --------------960KA/KB------- 960CAr-
M S 8 8 0 0 0  ------------------ MC88100----------------------
H P  P A  -------------------------------------------------------
S p a r c  ------------------------------------------------------
M I P S R  ------------------------------------------------------
-PA7000- -PA7100-
-MicroSparo— SuperSparo- 
---------------------R4000----
A m 2 9 0 0 0
29040
IBM  P o w e r -  
D E C o  -  
P o w e r P C  -
Powerl _  
"RS/6000
29000sup
_PowerPC601_
PowerPC603
8 7
C I S C  P r o c e s s o r s
In te l  x 8 6  -------------
M 6 8 0 0 0  -------------
G m ic r o  -------------
A M D K 5  -------------
C Y R IX  M 1 -------------
88 8 9 9 0 91 9 2 9 3 9 4 9 5 9 6
-M68040-
-Gmicro/IOOp- -GmicroSOO—
-K5-
Figure 2. The appearance  of su p e rsca la r processors on a  tim eline [23]
like reservation u n its  and  reorder buffer are  necessary  to p rocess an  in stru c tio n  in  a  
supersca la r pipeline.
5. S uperscalar, Superpipelined A rchitectures
If the  in ternal stages in  a  pipeline are  them selves pipelined, the  arch itec tu re  is 
called superpipelined. This facilitates the  u se  of faster clocks and  m echan ism s to 
avoid a  k ind of WAR stall.
Figure 3 show s the  evolution of com m ercial su p ersca la r p rocessors [23].
2 .1 .6  F ourth  G eneration Processors:
Im provem ents in  processor perform ance are  achieved by two m eans:
> Advances in  sem iconductor technology
> Advances in processor m icroarchitecture .
To su s ta in  the  h istoric  ra te  of increase in  com puting power, it is im portan t for 
im provem ents to occur in bo th  ways m entioned. It is certain  th a t  clock frequencies will 
continue to increase. The m ain  a rch itec tu ra l challenge is to issue  m any in stru ctio n s  per
17
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
In tel
IBM
P o w e r P C
A ll ia n c e
M o to ro la
D E C
H P
-9 6 0 C A - -960M M - -960HA/HD/HT
-Pentiurr- -PentiumPro-
“Power— Powerl _ "RS/6000 -RSC-
-ES/9000-
-88000-
-MC68060-
- q21 064A------- a21164-
-PA- -PA710C-
S u n /H a l  — SPARC-
T R O N   Gmicro-
M IP S   R------
-SuperSparc- _ UltraSparc . PM1(Sparc64J
-Gmicro/500-
A M D
-2900C-
-K5-
-29000sup- 
 K5-----
C Y R IX  ------ M1-
N e x G e n   Nx-
A s t r o n a u t ic s  -------
C o rp
-M1-
8 9 9 0 91 9 2 9 3 9 4 9 5 9 6
Figure 3. Evolution of Com m ercial S uperscalar Processors
cycle an d  to do so efficiently. Five nex t generation processors are  described  in  th is  
section, w hich exploit the  ILF in  a  system  together w ith speculation.
1. Superspeculative Processors [8]:
These are wide issue  su p e r  sca la r processors, th a t  can  issue  u p  to 32 
in structions per cycle. The inability to go beyond the  d a ta  flow lim it res tric ts  the  
com plete exploitation of In struction  Level Parallelism . Superspeculative processors 
overcome the  dataflow  lim it problem  by aggressively specu lating  on p a s t true  
dependencies an d  exploring additional in stru c tio n  parallelism .
The core b asis  for the superspecu lative processors is th a t  the  p roducer 
in structions generate highly predictable d a ta  in  real program s. By successfully  
sp ecu la tin g  o n  th e  so u rce  op eran d  v a lu e s , th e  c o n su m er  in str u c tio n s  c a n  start  
execution w ithout w aiting for the  resu lt of the  producer in structions. T hus, a  
superspeculative processor rem oves the  serialization co n stra in ts  betw een the 
p roducer and  consum er in stru ctio n s , there  by th ru s tin g  its perform ance to go 
beyond the  classical d a ta  flow lim it w ithou t sacrificing the code com patibility.
18
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2. Trace Processors [9]:
T ra ce s’ are  dynam ic in stru c tio n  sequences construc ted  and  cached  by 
hardw are. Traces are bu ilt a s  program  executes an d  are  stored  in a  cache. Trace 
processor system s work by b reak ing  down the  system  into several p rocessing  
elem ents (PE) and  the  program  into several traces so th a t  the  cu rre n t trace  is 
executed on one PE while the  fu tu re  traces are  speculatively executed on o ther PEs.
E ach  processing  elem ent h a s  enough in struction  buffer space to hold a n  entire  
trace, m ultiple dedicated  functional u n its , a  dedicated  local reg ister file for holding 
the local values an d  a  copy of the  global reg ister file. In struction  fetch hardw are 
segm ents the  program  into traces, each of w hich m ay have 8 to 32 in stru ctio n s  as  
well a s  em bedded predicted  conditional b ranches. The traces are placed in  a  trace 
cache and  a  trace  fetch u n it subsequen tly  reads the  traces from the  trace cache and  
sends them  o u t to the  parallel processing elem ents. Hence, the  trace becom es the  
basic execution u n it th rough  ou t the  processor. Two m ajor advantages of the  trace 
processors are;
■ The physical reg isters are divided into local an d  global registers. This 
h ierarch ical organization allows for sm aller reg ister files w hich have fast 
access tim es and  fewer po rts  pe r file.
■ Successful value prediction of the  tra c e ’s d a ta  allows the  trace  to be 
executed im m ediately an d  in  parallel w ith o ther traces.
3. M ultiscalar Processors [10]:
M ultiscalar p rocessors divide a  program  into  different ta sk s  th a t  are d istribu ted  
to a  nu m b er of parallel processing elem ents (PEs) which are controlled by a  single 
hardw are sequencer.
A program  is divided into a  collection of ta sk s  by u sing  software an d  hardw are. 
These ta sk s  are  th en  d istribu ted  to the  parallel processing elem ents. E ach  PE 
fetches an d  executes the  in stru c tio n s  assigned  to it. The appearance  of a  single local 
reg ister file is m ain ta ined  w ith a  copy in  each  PE. Compiler generated  m asks enable
19
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the  dynam ic rou ting  of the  register re su lts  to different processing  u n its . Memory 
accesses occur speculatively and  the  add resses  are  decoded dynam ically. The only 
w ait involved in th is  system  is caused  by the  tru e  d a ta  dependencies.
4. D a tasca la r Processors [12]:
The D atasca la r m odel of execution ru n s  the  sam e sequen tial program  
redundan tly  across m ultiple processors. The d a ta  se t is d istribu ted  across physical 
m em ories th a t  are  tightly coupled to th e ir  d istinc t p rocessors. E ach processor 
b roadcasts  operands th a t  it loads from its local m em ory to all o ther processors. 
Instead  of explicitly accessing  a  rem ote mem ory, processors w ait u n til the  requested  
value is b roadcasted . S tores are  com pleted only by the processor th a t  owns the 
operand , and  are  dropped by the  o thers.
This arch itec tu re  exploits the  fact th a t  all m em ory is local to som e processor in a  
m ultiprocessor system . T hus each  read  operand  can  be fetched by som e processor 
and  each  m em ory u p d a te  can  be achieved by m eans of a  write by some processor. A 
m ajor advantage of the d a ta sca la r arch itec tu re  will be its ability to exploit 
parallelism  in codes th a t  were not traditionally  though t of a s  eligible for parallel 
processing. D atasca la r m odel is way of optim izing the  m em ory and  is no t in tended  
to be su b s titu te  for parallel processing.
5. Advanced S upersca lar Processors:
These are wide issue  sup e rsca la r p rocessors th a t  can  issue  u p  to 32 
in structions  pe r cycle.
An im portan t feature  of th is  arch itec tu re  is its large trace  cache and  a  large 
n u m b er of reservation  sta tions to accom m odate 2000 in structions. There are 24 to 
4 8  h igh ly  op tim ized , fu n ctio n a l u n its . A ggressive sp ecu la tio n  is  perform ed to pred ict  
the b ranches.
20
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2.2 Prior resea rch  on special a rch itec tu res
2.2.1. Billion T ransisto r A rchitectures [12,13]:
Doug B urger an d  G oodm an specu la te  and  explore the fu tu re  tren d s  in  com puter 
arch itec tu re . They extrapolate the  scope of having one billion tra n s is to rs  on a  single 
chip. Im portan t tren d s  th a t  would take  place over the  course of next 10 years  are 
d iscusses in  th is  paper.
A one billion tra n s is to r  chip would require hardw are m anipu lation  b u t the  physical 
lim its like on chip signaling, wire delays, global clock would be serious constra in ts . 
They expect a  q u an tu m  leap in com piler’s ability to ex tract parallelism  thereby  shifting 
som e of the  parallelism  from the  hardw are  to th e  software. Considering the  growing 
costs involved in  design, verification an d  testing, the au th o rs  conclude th a t 
a rch itec tu res th a t  simplify the  in terac tion  am ong on-chip com ponents a n d /o r  reduce 
the  nu m b er of in terac ting  com ponents will have g rea ter advantage over a rch itec tu res  
th a t  do not.
2.2 .2 . One Billion T ransisto rs , One U niprocessor, One Chip [14]:
Patt e t.al propose th a t  w hen system s w ith one billion tran s is to rs  are  available, 
com puting system s w ith h ighest perform ance will have a  single processor on each  
processor chip. They identify a rch itec tu re  th a t  will have h ighest perform ance by
utilizing the  m axim um  available in stru c tio n  bandw idth . The hardw are will consist of the
following com ponents:
■ A large trace  cache
■ A large n u m b er of reservation  sta tions
■ A large n u m b er of pipelined functional u n its
■ Sufficient on-chip d a ta  cache
■ Sufficient reso lu tion  an d  forw arding logic
These com ponents are n ecessa iy  for aggressive speculation  u s in g  aggressive b ran ch  
predictor an d  for veiy wide issue  sup e rsca la r processing.
21
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The h ighest perform ance com puting system  will be a  m ultiprocessor consisting  of 
powerful single chip un iprocessors. These will issue  an d  execute 16 or 32 in stru c tio n s  
per cycle w ith nearly  100 percen t b ran ch  prediction accuracy.
2.2.3. Dynam ic In struction  Set C om puter [15]
Michael W irthlin an d  B rad H utchings describe a  new com puter arch itec tu re  th a t  
can  sup p o rt dynam ic m odification of its in stru c tio n  se t based  on th e  dem and  of the 
incom ing instruction . They p resen t a n  im plem entation  of a  DISC a rch itec tu re  b ased  on 
th ree  techniques:
■ Partial FPGA reconfiguration -  Partial reconfiguration provides the  ability to 
reconfigure a  su b  section of a n  FPGA while rem aining logic operates unaffected. 
In structions occupy FPGAs only w hen needed while FPGA resources can  be 
reused  to im plem ent an  arb itra ry  n u m b er of perform ance enhancing  application 
specific in structions.
■ Relocatable hardw are -  Relocatable hardw are gives the flexibility to relocate or 
m ake p lacem ent decisions of partia l configurations a t ru n  tim e. This featu re  is 
u sed  in  DISC to enhance  ru n  tim e hardw are  utilization. Relocating hardw are 
w orks on a  strictly  defined global context. Every in struction  m odule is 
configured on to FPGA in su c h  a  way th a t  each  m odule is a s  close as possible to 
the  o ther in  o rder to avoid w asted  hardw are  betw een m odules. A global context 
provides physical p lacem ent positions an d  a  com m unication netw ork necessary  
for these  m odules to operate correctly.
■ Linear H ardw are Model -  DISC im plem ents relocatable hardw are in the  form of a  
linear hardw are model. The two dim ensional grid of configurable logic cells are 
organized as an  array  of rows. E ach  m odule’s location is specified by the  vertical 
and  horizontal location while the  size of the  m odule is given by the  m odule 
height.
DISC is an  exam ple of application specific processor w ith large in stru c tio n  se ts  th a t 
can  be im plem ented on partially  reconfigurable FPGAs.
22
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2.2.4. VISA: A Variable In struction  Set A rchitecture [16]
The a u th o r  of th is  paper describes an  in stru c tio n  coding techn ique th a t  reduces the  
w idth of the  in stru ctio n s  u sing  dynam ic in stru c tio n  coding m anaged  by the  com piler. A 
RISC processor is constra ined  by the  in struction  w idth to keep it w ith in  th e  lim its 
im posed by silicon. In th is case, the  com piler defines the  set of in stru ctio n s  required  in 
order to execute a  given program  an d  selects the  hardw are function  th a t  can  be 
activated du ring  the  sam e m achine cycle by u sing  an  instruction .
Compiler divides an d  determ ines the  in stru c tio n  se t based  on two factors:
■ F unctions to be activated
■ N um ber of b its  needed by su ch  functions.
The au th o r p resen ts  a  new VISA based  m icroprocessor nam ed VISP w hich delivers high 
perform ance u sing  the  variable in stru c tio n  se t arch itec tu re  for general a s  well as 
floating point calculations. The resu lt of the  new arch itec tu re  is m ore com pact code and  
notable increase in  optim ization capabilities of the  compiler.
2.2.5. Application Specific Instruction  Set Processor (ASIP) [17]
C hand ra  S hekhar e t al. com pare software based  general purpose a rch itec tu res  to 
dedicated  hardw are  a rch itec tu res and  identify how th e  benefits of bo th  are  realized 
th rough  ASIP arch itec tu res. D edicated hardw are a rch itec tu res can  be com binational, 
sequential, pipelined, an d  parallel or can  be a  mix of any of these  b u t a  change in
functional specification necessita tes  a  change in the  arch itec tu re . These are  closely tied
to the logic specification of the  specific application and  hence are  veiy inflexible in  their 
functionality.
On the  o ther side are the general purpose  a rch itec tu res w hich can  im plem ent any 
logical function  w ithout requiring any  chance in the  hardw are. This flexibility com es 
from  th e  u se  of a  rich  in struction  set. The CPU hardw are is designed only to execute 
any  in struction  from the  in struction  se t loaded into its in struction  reg ister and  th en  
proceed to load an d  execute the  next in struction  from the m em ory into in struction  
reg ister of the  CPU. W henever there  is a  change in  specification, only the  sequence of
23
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
in stru ctio n s  stored  in  the  m em ory changes, w hich is why it is called software based  
arch itec tu re .
The inclusion of complex in stru c tio n s  in  the in struction  se t of the  processor in 
addition  to the  necessary  general purpose  in stru ctio n s  m akes the  in stru c tio n  se t and  
the  processor application specific.
A uthors propose th a t  the  ASIP hardw are  arch itec tu re  would con ta in  a  n u m b er of 
application specific functional blocks an d  the  necessary  bussing  to move the  da ta . This 
reduces the  m em ory accesses an d  the  d a ta  tran sfe rs  am ong the  hardw are  blocks. A 
reduced nu m b er of b u sses  in the  CPU reduces the  am o u n t of b u s  interface logic in  the  
functional blocks in  the  CPU an d  control logic in the  control p a rt  of the  processor. ASIP 
processors will ru n  m ultiple overlapped executions of operations in  different functional 
u n its  to achieve m axim um  possible concurrency. Pipelining occurs a t the  functional 
block level. Application specific in stru c tio n  se ts  and  p rocessors are su itab le  for 
em bedded applications as they  perm it a n  a ltera tion  of hardw are-softw are b oundary  to 
m eet the  speed and  energy co n stra in ts  of a  specific application.
2.2.6. Application Specific A rchitectures [18]
Chris W eaver e t al. proposed th a t  the  po ten tia l of the  application specific 
a rch itec tu res can  be h a rn essed  by specializing a  design to a  sm all dom ain of im portan t 
applications. The benefits of th is  approach  would be improved perform ance, g rea ter 
power efficiency an d  reduced  costs. Key differences betw een general purpose  
arch itec tu re  v ersu s application specific arch itec tu re  are  d iscussed  in th is  paper.
Producing a  dedicated  hardw are  for an  algorithm  im proves the  perform ance 
drastically  an d  reducing  the  silicon a re a  costs. This is obtained by elim inating all 
aspec ts  of the  design th a t  are n o t necessary  for the  algorithm . On the  o ther h an d , the  
m ain  d raw backs of an  application specific arch itec tu re  are increased  m arginal design 
costs due to lesser p roduction  volum es an d  reduced  design flexibility a s  the  hardw are 
im plem entation canno t be changed  after it h a s  been  m anufactu red . The m ain  barriers  
of en try  for application specific a rch itec tu res  are:
24
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1. Identifying the  scope of th e ir application dom ains. This requ ires analysis of 
perform ance, power efficiency an d  econom ies of scale.
2. Reduce the  design problem s in h eren t for application specific a rch itec tu res.
In suppo rt of their a rgum en ts, the  a u th o rs  p resen t a  detailed case analysis of T he 
CryptoM aniac P rocessor’. The a rch itec tu re  an d  its application specific optim izations are 
explained.
2.2.7. An FPGA based  Forth  M icroprocessor [19]
Applications w hich u se  application specific FPGA along w ith a  m icroprocessor have 
two d istinct advantages:
1. Reduce the  power consum ption
2. Reduce the  system  costs by incorporating  the  m icroprocessor in  the  FPGA.
In th is paper, a  16 b it FPGA based  m icroprocessor called MSL16 is described  w hich 
executes the ‘F o rth ’ program m ing language. This is based  on s tack  a rch itec tu re  w ith 
each  in struction  occupying 4 b its  leading to sm all in struction  set, sim ple d a ta p a th  and  
control and  high code density.
MSL 16 consists  of a  16 deep d a ta  s tack  for tem poraiy  variables an d  subrou tine  
param eters  an d  a  T reg ister holds the  top elem ent of the stack  so th a t  the  top two 
elem ents of the  s tack  are  available to the  ALU sim ultaneously . It also con ta ins a  16 
deep re tu rn  s tack  to store sub rou tine  re tu rn  add resses, a  in stru c tio n  reg ister w hich 
holds the  4-b it in stru c tio n s  to be executed, a  PC an d  an  IR w hich store the  add ress  of 
the  next in struction  an d  finally an  ALU w hich takes operands from T an d  the  top 
elem ent of e ither DS or RS an d  re tu rn s  the  resu lt to T.
Forth  m ach ines are su itab le  for em bedding in  FPGA applications because  of good 
code density , easy  custom ization, easy  to hand le  developm ent tools, high perform ance 
and  sm all area.
2.2.8. Flexible In struction  Processors [20]
The au th o rs  in troduce  a  Flexible In struction  Processor (FIP) for system atic  
custom ization of in stru c tio n  processor design an d  im plem entation. G eneral pu rpose
25
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
processors lose perform ance w hen dealing w ith custom  operations and  n o n -stan d a rd  
da ta . Custom izing the  processor is required  in  su ch  cases. This can  be done e ither by 
augm enting  the  processor w ith program m able logic for im plem enting custom  
in structions  or by im plem enting the  in stru c tio n s  u s in g  FPGAs. A pplication specific 
in struction  p rocessors provide an o th er m ethod  of producing custom  processors.
The un ique  featu res of FIP include:
■ A m odu lar fram ew ork based  on processor tem plates th a t  cap tu re  various 
in struction  processor styles, su c h  a s  stack -based  or reg ister-based  styles.
■ E nhancem en ts of th is  fram ew ork to improve functionality  an d  perform ance, 
su ch  as hybrid processor tem pla tes an d  su p ersca la r operation
■ Com pilation stra teg ies involving s ta n d a rd  com pilers and  FIP specific com pilers, 
an d  the  associated  design flow
■ Technology independen t an d  technology specific optim izations su c h  as 
techn iques for efficient resource sh a rin g  in FPGA im plem entations
FIPs are  assem bled  from a  processor tem plate w ith m odules connected together by 
com m unicating channels. The tem plate  can  u sed  to produce different styles of 
processors su ch  as  s tack -based  and  reg ister-based . The pa ram ete rs  of a  tem plate  are 
selected to transform  a  skeletal p rocessor in to  a  processor su ited  for its task . Possible 
param eterizations include addition  of custom  in structions, rem oval of u n n ecessa ry  
resources, custom ization of d a ta  an d  in struction  w idths, optim ization of op-code 
assignm en ts, an d  varying the  degree of pipelining.
W hen a  FIP is assem bled, required  in stru c tio n s  are included from a  lib rary  th a t  
con ta ins im plem entations of these  in stru c tio n s  in  various styles. D epending on w hich 
in structions are  included, resources su ch  as s tacks, different decode u n its  are 
in stan tia ted . C hannels provide a  m echan ism  for dependencies betw een in stru ctio n s  
an d  resources to be m itigated. This FIP fram ew ork h a s  been im plem ented in Handel-C.
26
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2.2.9. Power efficient flexible processor a rch itec tu res  for em bedded applications [21]
A novel processor a rch itec tu re  is proposed in  th is paper w hich provides the 
flexibility needed in practice a t  a  reduced  pow er an d  perform ance cost. A novel protocol 
which com bines a n  efficient, custom ized com ponent w ith a  flexible processor into 
hybrid arch itec tu re  is proposed.
Based on the  required  flexibility, target technology and  processor a rch itec tu re  are 
selected independen t of their reuse  considerations. Com ponents benefiting from a  
custom  hardw are  im plem entation are  still im plem ented in th e ir optim al arch itec tu re . 
Flexibility is added  to the  system  as  a  separa te  program m able com ponent, w hich can  
take over control in  those cycles w here functionality  needs to change. This novel 
protocol allows for fine grain  control w hich is needed since it is n o t know n in advance 
which execution cycles of the  hardw are realization will have to be su b s titu te d  by a  new 
functionality  on the  flexible platform . The fine grain  control is realized w ith a  control 
flow inspection m echan ism  an d  an  in te rru p t m echanism . The custom ized m em oiy 
a rch itec tu re  is shared  w ith the  flexible com ponent, solving the d a ta  tran sfe r and  storage 
bottleneck  for m ultim edia  applications.
All p rocessor described above have s ta tic  d a tap a th s . The hardw are  is incapable  of 
adap ting  to in p u t ta sk s  a t run-tim e. H ardw are is usually  designed w ith sufficient 
resources for all possible types of applications expected to ru n  on it. However, all ta sk s  
th a t  require m inim al resources and  th e  ta sk s  th a t  require m axim um  resources p ass  
th rough  the  sam e d a tap a th , w hich reduces the  overall u tilization of hardw are. This 
issu e  h as  been add ressed  in th is  thesis. C hap ter 3 explains the  su p ersca la r pipelining 
concepts for w hich the  problem  can  be defined clearly.
27
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 3
SUPERSCALAR PIPELINED ARCHITECTURE 
The schem e proposed in th is  thesis  w ork is evaluated on the  DLX arch itec tu re  
designed by H ennessey and  P a tte rson  as  a  representative a rch itec tu re  of m ost 
com m ercial processors. This ch ap te r explains the  arch itec tu ra l design of the  DLX 
m achine. It also p resen ts  the  design concepts of a  pipelined, su p e rsca la r arch itec tu re .
3.1 DLX A rchitecture
The DLX is a  sim ple load-store arch itec tu re  described in  [1]. It is developed purely  
for academ ic in te rests , w ith an  a rch itec tu re  sim ilar to m ost com m ercial com puters like 
AMD 29K, D EC station 3100, HP850, 1BM801, Intel i860, etc .[l]. The DLX arch itec tu re  
consists of thirty-tw o 32-b it general pu rpose  registers called RO, R l, R2, ... R31 where 
RO always holds the  value zero. The word size for the  DLX is 32 -bit. Integer d a ta  and  
floating point d a ta  of single precision is th u s  32 -bit, while double precision floating 
poin t d a ta  is 64-bit. The DLX u se s  the  im m ediate an d  d isp lacem ent addressing  m odes 
for da ta , w hich are  stored in a  16-bit field. Main m em oiy is accessed  u s in g  a  3 2 -bit 
add ress an d  it is byte addressab le . The operations supported  by the  DLX are  classified 
in to  four m ajor types: ALU, b ran ch , load-store an d  floating point operations. Table 1 
lists  the opcodes of these  operations. The control in structions  are  ju m p s  an d  b ran ch es, 
w here b ran ch es are conditional w hich need to be evaluated before the  b ran c h  is 
resolved. The floating poin t u n it  of DLX hand les all floating poin t operations as  well as 
in teger operations of m ultiply an d  divide.
28
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 4 show s the  s c a l^ ,  pipelined im plem entation of DLX. It consists  of five 
stages: In struction  fetch, in struction  decode an d  register fetch, execute an d  effective 
add ress calculation, m em ory access and  w rite-back  stage.
M e m o r y
A c c e s s
W r i t e  B a c k
I n s t r u c t i o n
F e t c h
I n s t r u c t i o n
D e c o d e
E x e c u t i o n
Figure 4: Pipeline stages in  a  DLX arch itec tu re
1. In struction  Fetch
The fetch stage is responsib le for fetching the  in struction  to be executed. It 
fetches a  new in stru c tio n  a t  eveiy clock cycle u n less  the pipeline is stalled. The DLX 
u ses  a  special reg ister called the  Program  C ounter (PC) to store the  add ress of the  
next in struction  to be fetched. The PC is increm ented  by 4 to poin t to a  sequen tial 
in struction  stored  in  the  next m em ory word. The fetched in stru c tio n  is stored  in a  
special reg ister called the  Instruction  Register (IR), while a  special reg ister called the 
Next Program  C ounter (NPC) sto res the  add ress of the  next in stru c tio n  to be fetched.
2. In struction  Decode
This stage decodes the  in stru c tio n  stored  in  IR and  accesses the  reg ister file to 
read registers th a t  con ta in  data . Since the  DLX is a  load-store a rch itec tu re , 
operands are  first loaded into reg isters u sing  the  load in stru c tio n  and  th en  
operations are perform ed on them . These operands are read  in to  two tem porary  
registers (A an d  B). If im m ediate add ressing  m ode is u sed , th en  it is sign-extended
29
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 1 DLX instructions [1]
Instruction type/opcode Instruction m eaning
Data transfers
Move data between registers and memory, or between  
the integer and FP or special register; only memory 
address mode is 16-bit displacem ent + conten ts o f a 
GPR
LB, LBU, SB Load byte, load byte unsigned , store byte
LH, LHU, SH Load halfword, load halfword unsigned , store halfword
LW, SW Load word, store  word (to/from  in teger registers)
LF, LD, SF, SD
Load SP float, load DP float, store SP float, store  DP float 
(SP - single precision, DP - double precision)
M 0VI2S, MOVS2I Move fro m /to  GPR to /fro m  a  special reg ister
MOVF, MOVD
Copy one floating-point register or a  DP pa ir to an o th e r 
reg ister or pa ir
MOVFP21, MOV12FP Move 32 b its  fro m /to  FP tegister to /fro m  integer reg isters
Arithm etic /  Logical
Operations on integer or logical data in GPRs; signed  
arithm etics trap on overflow
ADD, ADDI, ADDU, ADDUl
Add, add  im m ediate (all im m édiates are  16-bits); signed 
an d  unsigned
SUB, SUBI, SUBU, SUBUI S ub trac t, su b tra c t im m ediate; signed an d  unsigned
MULT, MULTU, DIV, DIVU
M ultiply an d  divide, signed and  unsigned ; operands 
m u s t be floating-point registers; all operations take  and  
yield 32-b it values
AND, ANDI And, an d  im m ediate
OR, ORI, XOP, XOPI Or, or im m ediate, exclusive or, exclusive or im m ediate
LHI
Load high im m ediate - loads u p p e r ha lf of reg ister w ith 
im m ediate
SLL, SRL, SRA, SLLI, SRLI,
Shifts: bo th  im m ediate(S__I) and  variable form (S_);
shifts are  shift left logical, right logical, righ t arithm etic
S _ ,  S _ I Set conditional: "__" m ay be LT, GT, LE, GE, EQ, NE
Control
Conditional branches and jumps; PC-relative or 
through register
BEQZ, BNEZ
B ranch  GPR e q u a l/n o t equal to zero; 16-bit offset from 
PC
BFPT, BFPF
Test com parison b it in  the  FP s ta tu s  reg ister an d  b ranch ; 
16-bit offset from PC
J ,  JR Ju m p s: 26-b it offset from PC(J) or target in register(JR)
JAL, JALR
Ju m p  and  link: save PC+4 to R 3 1, target is PC- 
relative(JAL) ot a  register(JALR)
30
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
TRAP T ransfer to operating system  a t  a  vectored ad d ress
RFE R eturn  to u se r  code from an  exception; resto re  u se r  code
Floating point Floating-point operations on DP and SP formats
ADDD, ADDF Add DP, SP num bers
SUED, SUBF S u b trac t DP, SP num bers
MULTD, MULTF M ultiply DP, SP floating point
DIVD, DIVF Divide DP, SP floating point
before being stored  in a  register. In struction  decoding an d  accessing  of reg ister file 
is done concurren tly  due  to fixed-width in struction  form at.
3. E xecution an d  Effective A ddress Calculation
The in stru c tio n  is issued  to execution u n it w hich perform s the  desired  arithm etic, 
com pare, logical or shifting operation. If it is a  load or store in struction , th en  th is 
stage perform s effective ad d ress  calcu lation  for generating m em ory add ress from 
w hich or a t  w hich d a ta  is to be loaded or stored.
4. Memory Access
For load in structions , d a ta  is fetched from the  m em ory add ress  generated  in  the 
previous stage and  loaded into load m em ory register, while for store in struction , 
d a ta  is w ritten  from specified reg ister into mem ory. For brsm ch in stru ctio n s , the 
condition for b ranch ing  is evaluated  in  the  previous stage, an d  the  PC is replaced or 
increm ented  based  the resu lt produced. ALU instructions are com pleted in  th is 
stage by w riting the  resu lt of ALU operations into the  desired reg ister file location. 
This is com m only referred to as  the  MEM stage.
5. W rite-back
Register file is u p d a ted  w ith the  d a ta  from Load M emoiy Register, an d  the  load 
in struction  is com pleted.
The scala r DLX pipeline can  be extended to a  su p ersca la r pipelined version u sing  
sup e rsca la r concepts described in the  next section. The nu m b er of pipeline stages and  
th e ir  functions rem ain  sim ilar. Section 3.3 describes the VHDL im plem entation  of a  
sup e rsca la r an d  pipelined DLX arch itec tu re .
31
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.2 Generic S upersca lar Pipeline
A su p ersca la r pipeline is characterized  by concurren t in stru c tio n  processing  and  
out-of-order execution. A sup e rsca la r pipeline parallelizes in stru c tio n  execution by 
duplicating  processing elem ents. Figure 5 show s the  block diagram  of a  generic 
sup e rsca la r pipeline of w idth s. The pipeline consists of six m ain  stages: in struction  
fetch, in struction  decode, d ispatch , execute, reorder and  retirem ent. These stages 
perform  task s  sim ilar to those perform ed by the  five-stage pipeline described  for the  
DLX arch itec tu re . The fetch, decode an d  d ispa tch  stages perform  a n  in-order execution 
of in structions, the  execute stage processes in stru ctio n s  in a n  out-of-order m anner. The 
reorder stage forces the in stru ctio n s  to retire  in  an  orderly fashion.
R e t i r e m e n t
I n s t r u c t i o n
F e t c h
I n s t r u c t i o n
D e c o d e
D i s p a t c h
E x e c u t i o n
C o m p l e t i o n
Figure 5. G eneric S upersca lar Pipeline Stages
3.2.1 Instruction  Fetch
The objective of the  in stru c tio n  fetch stage is to fetch s  in stru ctio n s  in every clock 
cycle. The IF stage employs m echan ism s to m aximize the  in p u t bandw id th  of the 
pipeline to achieve th is  goal. A high in p u t bandw id th  is essen tia l to  achieve high 
in structions per cycle th roughpu t. The fetch bandw id th  is affected due  to:
1. m isaligned in structions
32
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2. control instructions tha t alter the sequential flow of a program
M isalignm ent h a s  been  add ressed  dynam ically in  the  IBM R S /6 0 0 0  a rch itec tu re  by 
u se  of hardw are logic th a t releases run -tim e control signals to the  in struction  cache to 
fetch m isaligned in stru ctio n s  in a  single m em ory access. O ther techn iques to reduce 
m isalignm ent include s ta tic  m echan ism s em ployed a t  com piler tim e. Control 
in structions like ju m p  an d  b ran ch  in stru c tio n s  change the program  flow. These are  
dealt u s ing  b ran ch  prediction schem es th a t  can  accurately  pred ict the  nex t in stru c tio n  
to be fetched an d  a ttem p t to keep the  in stru c tio n  fetch buffer filled w ith in structions.
3.2.2 Instruction  Decode
The in stru c tio n  decode stage deals w ith generating the  control signals necessary  for 
o ther m odules to correctly execute an  in struction . This includes separa ting  individual 
in structions, estab lish ing  the in stru c tio n  operation and  location of operands and  
determ ining in te r-in struction  dependencies. For m achines w ith a  fixed in struction  
length, the  ta sk  of separa ting  in stru c tio n s  is trivial. The num ber of add ressing  m odes 
an d  in struction  types add  to the  com plexity of the  decoder. The decoder identifies 
dependencies betw een in stru ctio n s  an d  ex tracts  parallelism  betw een them . It em ploys a  
large nu m b er of com parato rs for de term ining  dependencies. The decoder in  CISC 
m achines requires a  highly in trica te  design. If the  in struction  se t consists  of 
in structions  w ith variable lengths, th en  it is n o t possible to decode in stru c tio n s  in 
parallel. To reduce the  tim e tak en  for the  decoding, some com m ercial p rocessors like 
the AMD K5™[24] u se  pre-decoders. The pre-decoders decode a n  in stru c tio n  partially  
an d  com m unicate control b its  along w ith the  in struction  to the  in stru c tio n  decoder.
3 .2 .3  D ispatch
At the  d ispa tch  stage, the  following ta sk s  tak e  place: 
reg ister renam ing  
allocation of reservation  u n its  
allocation of reorder buffer en tries 
forw arding of in stru c tio n s  to the  nex t stage
33
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
There are  several types of execution u n its  p resen t in a  sup e rsca la r pipeline, for 
processing different types of in structions . For exam ple, integer operations are  hand led  
by integer u n its , while floating point operation  are  hand led  by floating poin t u n its . The 
d ispatch  stage is required  to rou te  an  in stru c tio n  to the appropriate  execution un it. 
In structions th a t  have been  decoded, b u t aw ait one or more operands are  placed in 
reservation u n its . Reservation u n its  are m ulti-en try  in struction  buffers th a t  a re  specific 
for each  execution u n it if im plem ented as  d istribu ted  reservation  u n its , or a  single 
global m u lti-en tiy  buffer if im plem ented as  a  centralized reservation  un it. They keep 
track  of in stru ctio n s  ready to execute an d  forward them  to the  execution u n it  to be 
executed once the  required  execution u n it  becom es available. Intel Pentium  Pro [25] 
u se s  a  centralized reservation  un it, while PowerPC 620 [26] u se s  a  d istribu ted  
reservation un it.
3 .2 .4  Execute
The execute stage in a  su p e rsca la r pipeline consists of one or m ore functional u n its  
or different types. Functional u n its  are specialized and  n um erous in  o rder to be able to 
execute in stru ctio n s  in parallel an d  in  a n  efficient m anner. The functional u n its  th a t  
are  generally p resen t in m ost su p ersca la r im plem entations are  load-store u n its , in teger 
u n its , b ran ch  u n its  an d  floating poin t u n its . The num ber of these  functional u n its  is 
decided by the  mix of in stru c tio n  types expected to ru n  on the  m achine. As the  n u m b er 
of functional u n its  is increased , there  is an  increase in  hardw are complexity due to 
increase in  forw arding p a th s  an d  in terconnections required  for rou ting  operands to the 
appropriate  execution un it.
3 .2 .5  Complete
In th is stage resu lts  of executed in stru c tio n s  are  w ritten  into desired registers. It is 
also responsible for com pleting in stru c tio n s  in  sequential order. This is necessary  to 
m ain ta in  the  sequen tial n a tu re  of program  execution. For th is purpose, it u se s  a  buffer 
called Reorder Buffer. The reorder buffer m ain ta in s  a  c ircu lar queue w hich enab les an  
in-order retirem ent of in structions.
34
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3 .2 .6  R etirem ent
Memory u p d a te s  generally require m ore latency. An instruction  th a t  involves w riting 
to m em ory is no t com plete u n til the m em ory operation is perform ed. The re tirem en t 
u n it perform s th is  action and  com pletes su ch  in structions.
3.3 Superscalar, Pipelined DLX im plem entation  in VHDL
The VHDL im plem entation of su p e rsca la r version of DLX is a  tw o-w idth five-stage 
pipelined 32 -bit arch itec tu re . It is capable of executing integer arithm etic  an d  logical 
operations, com pare, shift, ju m p  an d  b ran c h  in structions. It does no t con ta in  a  floating 
point un it. The arch itec tu re  u ses  an  In stru c tio n  Cache to store in stru ctio n s  loaded from 
m em oiy. Figure 6 show s th e  pipeline stages in  th is  im plem entation.
Instr-A Instr-B
Multiply
/Divide
Unit
Load/
Store
Unit
Integer
unit
Instruction Fetch
Decode and 
Dispatch
Retirement
Completion
Register File
Instruction
Cache
Figure 6. S uperscalar, Pipelined DLX Im plem entation in  VHDL
Each stage can  process two in stru c tio n s  sim ultaneously . Figure 7 show s the  block 
diagram  of the  in teger un it. It is im plem ented as  a  32-b it functional un it.
35
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reservation Unit
Select
operationMultiplexer
■0p2
R e s u l t
A d d e r /
S u b t r a c t e r
S h i f t e r
L o g ic a l
O p e r a t i o n s
C o m p a r a t o r
Figure 7. Block diagram  of in teger u n it  in  VHDL im plem entation of su p ersca la r DLX
The VHDL program  takes  a  tex t file contain ing  m achine codes as  inpu t. It can  be 
sim ulated  u sing  Active-HDL 7.1. B enchm ark  program s are usua lly  p resen t as 
assem bly-level program s. Such  b enchm ark  program s for DLX canno t be directly used  
as in p u t to the  VHDL program . Figure 8 show s the  d a ta  flow diagram  while u s in g  the  
VHDL DLX processor em ulato r code. B enchm ark  program s w ith extension .a sm  are  first 
converted to a  tex t file w ith extension .out u s ing  a  DLX assem bler program  called 
dlxasm[27] available freely. The dlxasm  a ssem bler converts DLX in stru ctio n s  into 
respective DLX m achine codes. E ach  m achine code is indexed by a  32-b it m em ory 
add ress  in w hich the  in struction  is expected to be stored in a  tru e  hardw are system . 
Form at of the  .a sm  an d  converted .out file is given in the  Appendix. The .out file is u sed  
a s  in p u t to the sim ulato r engine th a t  con ta ins the  VHDL code.
36
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
.asm file
.out file
waveform
testbenchdlxasm
Waveform
Viewer
Simulation
Engine
Figure 8. Dataflow diagram  during  sim ulation  of VHDL im plem entation of proposed
concepts
The sim ulation  engine p roduces waveform s for signals th a t  propagate in  the 
processor. These are in the  form of a  Value Change D um p (.VCD) file an d  can  be easily 
viewed u sing  a  waveform viewer.
Table 2[1] lists the  average of MIPS dynam ic in struction  mix p resen t in five 
SPECint2000 program s: gap, gcc, gzip, mcf, peri, an d  th a t p resen t in  five SPECfp2000 
program s: app lu , a rt, equake, lucas, swim.
Table 2 Average of MIPS dynam ic in stru c tio n  mix in SPECint2000 and  SPECfp2000
b en ch m ark  su ite
Instruction  types Average % of integer 
operations in integer 
benchm arks
Average % of integer 
operations in 
floating point 
benchm arks
Load-store 38% 22%
Add, s u b , com pare, shift, an d  ,or ,xor 45% 31%
B ranch, conditional move, jum p , 
call, re tu rn
16% 4%
37
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Majority of the  in struction  mix consists  of in teger operations of add , su b trac t, 
com pare, shift, and , or and  exclusive-or. From  these  sta tis tics , we can  conclude th a t  if 
there  exists only one in teger u n it, th en  m ore often th a n  not, a  centralized reservation  
u n it will be filled w ith w aiting in teger ALU instructions.
To ca te r to the  high percentage of ALU in stru ctio n s , it is necessary  to include m ore 
th a n  one ALU u n its . The n u m b er of ALU in stru ctio n s  can  vary wildly from one 
benchm ark  su ite  to ano ther. Therefore, unchecked  addition  of m ore ALU u n its  can  
resu lt in  idle u n its  or idle o ther functional u n its  in  the execution stage. Hence, a  flexible 
schem e is proposed in  th is  thesis  th a t  takes into accoun t the  observation th a t  value of 
operands of ALU operations is no t always as  large as  th a t accom m odated by the  word 
length  of the  m achine. This schem e a n d  all concepts associated  w ith it a re  explained in 
the  following chap ter.
38
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4
CONCEPTS AND IMPLEMENTATION 
A pipelined processor is designed so as  to im part m axim um  th ro u g h p u t. The design 
decisions include determ ining  the  w idth of the  pipeline (superscalar width), n u m b er of 
functional resources, and  n a tu re  an d  degree of dep th  of the  pipeline. This thesis  
concen tra tes on the  execution stage of the  pipeline. The execution stage consists of one 
or more execution u n its  of different types an d  reservation sta tions  in a  centralized or 
d istribu ted  arch itec tu re . While designing the  execution stage of a  processor, it is 
extrem ely im portan t to determ ine the  optim um  n u m b er of execution u n its  of each  type. 
This decision is typically based  on applications served by the processor an d  the  type of 
ta sk s  th a t  are expected to ru n  on it. E xecution u n its  are  provided to service all types of 
in stru ctio n s  p resen t in  th e  in struction  se t th a t  need a  com putation  u n it. A generic 
in struction  set consists of four types of in structions: ALU, B ranch , Load and  Store. The 
n u m b er of u n its  allotted for each  type of in struction  h as  to be determ ined  on the  basis  
of exam ple program s th a t  will ru n  on the  m achine and  the perform ance expected. The 
n u m b er th u s  decided upon  affects the  space requirem ents, power usage and  additional 
logic necessary  for sm ooth  functioning of these  u n its  in  parallel.
There are several reasons for the  design proposed in th is  thesis. The usage of an  
integer ALU u n it w as stud ied  by ru n n in g  several benchm arks on a  VHDL 
im plem entation of the  DLX sup e rsca la r processor. Table 3 shows the  resu lts  obtained.
39
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 3 Usage of ALU u n its  in  benchm arks
B enchm ark Program # of 
instn  
s
Time of 
sim ulatio 
n
# of ALU 
instns
# of ALU 
instns 
with 8- 
bit
# of ALU 
instns 
with 16- 
bit
# of ALU 
instns 
with 24- 
bit
# of ALU 
instns 
with 32- 
bit
ALUinstructionsPart 22 93.5 22 10 4 8 0
ALUinstructionsPart; 33 137.5 33 14 10 4 5
B ranchJum p 85.5 85.5 10 0 0 3 7
BubbleSort 6477 12191.5 2741 658 1 708 1374
Dbc 18 61.5 9 0 0 5 4
LoadStore 30 119.5 5 1 1 2 1
M DUinstructions 39 198.5 27 8 9 7 4
PrimeNumber 1321 6595.5 718 1 22 360 335
On analysis, it can  be seen th a t  less th a n  50% of the  ALU in stru ctio n s  u se  the  entire  
d a ta  w idth of the ALU. T hus the  usage of ALU is less th an  100%. In any design, if 
additional ALU u n its  are  added to ca te r to larger num ber of in p u t ALU in structions , 
th en  by projecting a  sim ilar usage  s ta tis tic  curve to these additional u n its , th e  overall 
ALU usage wül only decrease.
The u se  of reservation  sta tions encourages parallelism  am ong ready in stru ctio n s , 
w aiting for resources. In an  ALU intensive task , the n u m b er of su ch  w aiting 
in stru ctio n s  justifies the  u se  of high n u m b er of resources, while in  non-ALU intensive 
ta sk s , the  usage  of ALU u n its  is m inim al.
Also, in the  older m achines, floating point operations were perform ed u sing  integer 
u n its . As the  u se  of floating poin t operands increased , dedicated floating poin t u n its  
were in troduced  in the  execution stage. In these  m achines, while executing a  floating­
poin t intensive task , the in teger u n its  are idle for m ost period of execution tim e. If the 
in teger u n its  had  the  capability to perform  floating point operations on ready and  
w a itin g  FP in str u c tio n s , th e n  th e  th ro u g h p u t w ou ld  increase.
It is clear th a t the  usage  of execution u n its  would increase if there  w as a  techn ique 
to ca ter to different types of incom ing in stru c tio n  traffic. This thesis  adds run-tim e 
flexibility to hardw are m odules for the  purpose  of accom m odating as  m any in stru ctio n s  
a s  possible in the  execution un it. The exact ex tra  hardw are and  logic required  to do th is
40
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
is designed, im plem ented and  evaluated  in  Section 4.3, while the  general concepts 
associa ted  w ith th e  addition  of flexibility are  described below. These can  be applied in  
any  form to any  application.
4.1 Block Slicing
‘Block slicing’ refers to the  process of splitting  a  block into m ultiple m odules. The 
concept of block slicing can  be explained as  follows:
A functional u n it  w hich is capable of perform ing an  operation T  on two N-bit 
operands usua lly  consists of N in terconnected  copies of u n its  th a t can  perform  the 
operation W on two 1-bit operands. Let a  logic circuit capable of perform ing a  certa in  
operation on 1-bit operands be called a  unit. W hen N u n its  are in terconnected  so th a t  
they can  concurren tly  perform  the  operation  on N-bit operands, they form a n  N-bit 
module. In all im plem entations, N is know n or is pre-set. T hus, the  in terconnection  
netw ork, F  betw een u n its  th a t  form the  m odules is s ta tic  in  n a tu re . W hen operands of 
varying lengths are  encountered , the  value of N is required to be dynam ic. In order to 
allow N to be a  dynam ic value determ ined  a t run-tim e, it is necessary  to m ake the  
in terconnection  netw ork  flexible.
The netw ork can  be bu ilt to be com pletely flexible, b u t  it is im practical to reprogram  
it before execution of every in struction . Instead , a  degree of flexibility is allotted to it. 
For th is, m  u n its  are  connected together statically  to form m -bit functional u n its . Let 
each  m -bit u n it be referred to as  a  slice. E ach slice is capable of operating on m -bit 
operands. In a  contem porary  processor, if N-bit functional m odules are  p resen t, th en  
there  will be N /m  slices in a  sliced a rch itec tu re . For exam ple, a  processor contain ing  
one 64-bit ALU will now have four 16-bit slices (N=64 an d  m=16).
The in terconnection  netw ork F ' c  F  betw een slices is now com pletely flexible, so 
th a t  each slice can  operate independently , or connect itself to m ore slices an d  operate 
concurren tly  w ith them . W hen two m -bit slices operate independently , they  are  capable
41
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
of executing two in structions  sim ultaneously , provided the operands are  m -bit. W hen 
two slices connect together, they  form a  2m -bit functional m odule an d  can  operate on 
one in struction  w ith 2m -bit operands. Since there  are N /m  slices, w hen all slices are 
connected to each  o ther, they can  operate on N-bit operands as  before.
W hen a  m odule is th u s  separa ted  into sm aller p a rts , it is said  to be ‘sliced’. If a  
m odule is sliced into enough m -bit slices in  the  execution stage of a  processor, all ready 
in structions requiring  m -bit operands can  be executed in  parallel.
Based on the  ready in stru ctio n s  encountered , slices are  first allocated to each  
instruction . Once slice-allocation is decided, there  are  two functions associa ted  w ith the 
process of allocation before the  in stru ctio n s  are ready  to be executed:
1. Directing the  operands into the  correct operand  reg ister slices, and
2. Directing the  resu lt correctly into an  N-bit o u tp u t register.
These functions can  be perform ed by u s in g  decoders a t the  in p u t and  o u tp u t of the 
execution un it. A tru th  table for the  decoder can  be easily developed and  im plem ented 
as  the in te rnal c ircu itry  for the  decoder. Different execution u n its  need different 
decoding functions as  can  be seen from the  a rch itec tu re  explained in  the next section.
4.2 Sliced ALU Im plem entation
W hen a  sliced ALU is u sed , the  stages in  w hich a n  in struction  undergoes processing 
are  show n in Figure 9. The Resource M apping is done by a  u n it  called the  ‘Resource 
M apper’. It is th e  only additional stage th a t  gets added  to the pipeline, b u t  its latency is 
equivalent to a  few logic gates, and  hence it need n o t be pipelined as a  separa te  u n it, 
ra th e r  as  a  p a rt  of the  d ispatch  pipeline stage. It is explained in  detail in the  next 
section.
4.2.1 Resource M apper
This u n it  determ ines the  n u m b er of slices required  by an  incom ing in stru c tio n  and  
allocates slices for all incom ing in structions . For determ ining the  n u m b er of slices 
required  by an  in struction , the  resource  m apper perform s a  function called ‘zero-
42
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
checking’. This function determ ines the  length  of significant b its in  bo th  operands an d  
re tu rn s  the  m axim um  of these  two leng ths as the  num ber of slices requ ired  by the  
in struction . This can  be achieved sim ply by u sin g  AND gates. The zero-checking 
function is slightly different for the  shift operation, for which no t only the  n u m b er of 
significant b its  of first operand  are  required , b u t also the value of the  second operand. 
Using these  values an d  a  sim ple logic circuit, the num ber of u n its  requ ired  by a  shift 
in struction  can  be determ ined.
W ith each  reservation  u n it  is associa ted  a  reg ister called the  Resource Allocation 
Vector (RAY). The Resource Allocation Vector keeps track  of slices allotted to the  
in struction  stored in a  reservation  u n it. In addition, the  Resource M apper u se s  a  global 
reg ister called the  Resource Vector (RV). If there  are  m  slices in  the  execution u n it, th en
I n s t r u c t i o n
D e c o d e
R e o r d e r
I n s t r u c t i o n
F e t c h
D i s p a t c h
R e s o u r c e
M a p p i n g
R e t i r e m e n t
E x e c u t i o n
Figure 9. Processing stages for u sing  a  sliced ALU im plem entation
the  RAV and  RV are  m -bit. E ach  b it in  the  RAV an d  RV indicate a  s ta tu s  for slices of the 
execution u n it  as a llocated /no t-allocated . W hen a  slice is allotted to an  in struction , the  
b it in the  respective location of the  slice is se t to 1. W hen an  in struction  fin ishes u sing
43
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
th e  slice, the  b it is rese t to 0. In absence of slicing, a  sim ilar p rocess is followed by a  
reservation  u n it  to issue  in structions to an  execution un it. The reservation  u n it  checks 
th e  busy /ava ilab le  b it of a  functional u n it  an d  issu es  an  in stru c tio n  to it if it is free. 
The Resource M apper also issu es  a n  in stru c tio n  to one or m ore slices of functional 
u n its  an d  se ts  one or m ore b its  a t a  tim e in the  RAV of the  in struction  and  global RV 
respectively.
o^p1
En
> pA1
En
ipA2
En
•pB1
En
ipB2
Decoder
3
Decoder
Adder/
Subtracter
Decoder Decoder
Comparator
r >r
Decoder Decoder
r r
Compare Compare
logic logic
En En En En En En En
/o p 2 / o p ^ /0 P 2 / o p 1 /o p 2 / o p ^ / o p 2
Logical
Operations Shifter
. ) r  y r r  y r
Decoder Decoder Decoder Decoder
Result-A Result-B
Figure 10. Block diagram  of a  sliced ALU
44
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
If the  execution u n its  are all know n to finish the  execution of a n  in stru c tio n  in  one 
clock cycle, th en  a  global Resource Vector can  be assum ed  to be a n  all-zero n u m b er a t 
the  beginning of eveiy clock cycle, a n d  is red u n d an t. In th is case, allocation is done by 
exam ining all ready in stru c tio n s  w aiting for a  resource and  determ ining  the  n u m b er of 
slices required  by each. In the  situa tion  w here the  ready in stru ctio n s  need m ore slices 
th a n  available, the  in stru c tio n s  can  be prioritized based  on in struction  coun t an d  o ther 
in structions can  be stalled. De-allocation is no t necessary  here. The R esource Vector 
will only be needed if som e in stru ctio n s  take  longer th a n  a  clock cycle to finish. Though 
u n u se d  in th is thesis  work, the  u se  of Resource Vector h as  been proposed in  view of 
fu tu re  work, one in stance  of w hich is w hen in teger slices are  rearranged  into a  floating 
point pipeline, w ith a  latency of m ore th a n  one clock cycle.
ALU
Function
A
ALU
Function
B
Resource 
Allocation 
Vector 
A. B
-M Generate Enable 
-M Signals
Operands
A1.A2
Operands 
B1, B2
Load operands 
according to RAVs
Compute Result
___________]_r___________
Append Zeros 
Forward output
I
Figure 11. S teps of operation of a  sliced ALU
45
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 10 show s the  block diagram  of a  sliced ALU, while th e  flow chart in  Figure 11 
show s the  steps in  w hich a  sliced ALU functions. The Enable signals in Figure 10 are  
fed to D-flip-flops so th a t only the  appropriate  p a rt of the  ALU functions, while the  o ther 
p a rts  re ta in  their values. This leads to lower power consum ption.
4.3 A rchitecture of in teger execution u n its
The arch itec tu re  of a  sliced in teger u n its  th a t  are  used  to execute different types of 
integer in stru c tio n s  is proposed below. The integer u n it  com prises of an  
a d d e r/su b tra c te r  u n it, a  shifter, a  logical u n it an d  a  com parison un it.
4.3.1 A d d er/S u b trac te r Unit
Figure 12 show s the  design of a n  a d d e r /s u b tra c te r  m odule, bu ilt u s ing  two slices of 
4-b it a d d e r/su b tra c te r. The in p u ts  required  for the  4 -b it adder su b tra c te r  are  two 4-b it 
operands and  a  1-bit operation a d d /s u b  ('O' for addition  and  T ’ for subtraction). The 
a d d e r/su b tra c te r  m odule is designed by in terconnecting  signals betw een the  two slices 
an d  using  m ultiplexers to enable it to operate a t  variable d a ta  length. It is capable of 
perform ing an  addition  a n d /o r  sub trac tion  operation on the se t of operands {X3...X0} 
an d  {X7...X4}. Once the  operands are loaded into these registers, control signals 
a d d /s u b l  and  a d d /su b 2  are given to the  m odule. The control in p u t sel ind icates 
w hether the  two slices are  to perform  independently  or concurrently . M ultiplexer MUXl 
determ ines the propagation of a d d /s u b  signal to the second slice, while M ultiplexer 
MUX2 controls the  cascading  of carry  o u t signal from the  u n it  U3 to u n it  U4. 
M ultiplexer-3 generates the  overflow exception b its  v l  and  v2. After the  o u tp u t is 
produced, it is sign-extended in o rder to be p assed  on to the  resu lt reg ister and  
subsequen tly  stored.
W hen only one slice (say, slice-0) is to be u sed , the  signals {S4..S7}, v2 an d  Cout7 
are  ignored, an d  vice-versa. W hen bo th  slices are  u sed  for one operation, the 
appropriate  signals are  rou ted  to the  ou tpu t.
46
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
X,
Y X Gin 
Gout S7
o
Xs
V X Gin 
Gout Sb
X,
Y X Cin 
Cout S5
Y X Cin 
Cout Sa
X2
Y X Cin 
Cout S3
j r
Y X Cin 
Cout S2
Y X Gin 
Gout S
Y X Gin 
Gout So
v2
Figure 12. Two in terconnected  4-b it a d d e r /su b tra c te r  u n its  form ing one 2 -slice
adder /  su b trac te r
This design can  be extended to include num erous slices of the  a d d e r/su b tra c te r  
u n it. Figure 13 show s the  in terconnections of four a d d e r/su b tra c te r  slices, each  
capable of operating  on two 8-b it operands, resu lting  in  a  32-b it sliced ALU. The block 
diagram  of th is  flexible a d d e r /su b tra c te r  u n it  is show n in Figure 14.
op1 op2 add/sub3 op1 op 2 add/sub2 Op1[15:8] Op2[15:8] add/sub  1 O p1|7:0) Op2[7:0] add/subO
LT
3 83Cir3SLICE-3
cOul3 S3
fiZ B;
SLICE-2
coui: S2
fi1 BiC.MSLICE-1
C0UI1 s,
Ac Be
SLICE-0
COutc Sc
Figure 13. A rchitecture of flexible a d d e r/su b tra c te r  u n it
As explained before, there  are  two functions associated  w ith slice-allocation:
1. Directing the  operands into the  correct operand  register slices, and
2. Directing the  resu lt correctly in to  a n  N-bit o u tp u t register.
47
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
o p -1 o p -2
Couti Vq adC/sub 1
Cout, Vi add/sub 2Fexibe Adder/Subtracter
C out. V2 add/sub 3
Cout: V3 add/sub 4
Figure 14. Block diagram  of a  flexible a d d e r/su b tra c te r  u n it
The in p u t operands are initially p resen t in  N-bit operand  registers. Let an  ALU 
in struction  w ith two in p u t operand  reg isters contain ing 8-bit values be ready for 
execution and  be allotted Slice-1 in  Figure X. The operands have to be loaded a t 
locations [15:8] of registers o p l and  op2. Similarly, the  8-bit re su lt generated  by Slice-1 
h a s  to be directed to locations [7:0] of o u tp u t register.
Table 4 T ru th  Table for decoder t la t  d irects the  o u tp u t of four slices into resu lt registe r
RAV-3 RAV-2 RAV-1 RAV-0 Res-3 Res-2 Res-1 Res-0
0 0 0 0 0 0 0 0
0 0 0 1 MSB-0 MSB-0 MSB-0 SO
0 0 1 0 MSB-1 MSB-1 MSB-1 SI
0 0 1 1 MSB-1 MSB-1 SI SO
0 1 0 0 MSB-2 MSB-2 MSB-2 S2
0 1 0 1 X X X X
0 1 1 0 MSB-2 MSB-2 S2 SI
0 1 1 1 MSB-2 S2 SI SO
1 0 0 0 MSB-3 MSB-3 MSB-3 S3
1 0 0 1 X X X X
1 0 1 0 X X X X
1 0 1 1 X X X X
1 1 0 0 MSB-3 MSB-3 S3 S2
1 1 0 1 X X X X
1 1 1 0 MSB-3 S3 S2 SI
1 1 1 1 S3 S2 SI SO
The direction of in p u t to appropriate  in p u t reg isters and  of the  o u tp u t to a  re su lt 
register is done by the  u se  of decoders. The tru th  table for a  decoder th a t  perform s th is
48
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
function is show n in  Table 4. The RAV is the  individual Resource Allocation Vector se t 
for an  instruction . There are  two decoders, one for each  in struction , w hich are  in p u t the  
RAVs for two in stru ctio n s  an d  o u tp u ts  respective sign-extended resu lt.
The a d d e r /su b tra c te r  u n its  along w ith in p u t an d  o u tp u t decoders constitu te  the 
com plete flexible a d d e r/su b tra c te r . Area analysis for th is  m odule is m ade in  section 
4.4.
4 .3 .2  Com pare Unit
The com pare operation is requ ired  to be perform ed on bo th  singed and  unsigned  
operands, and  requires a  slightly different trea tm en t for each. Figure 15 show s the  
block diagram  of a  com parato r th a t  can  perform  signed com parison or unsigned  
com parison based  on a  1-bit control signal (0 for unsigned , 1 for signed).
o p A  o p B
uSigned/unsignedC o m p a r a to r
TTT
Figure 15. Block diagram  of a  com parator
This com parato r can  be designed as a  m inim al-delay circuitry, or it can  be designed 
w ith m inim al a rea  constra in t, depending  u p o n  the  constra in ts  im posed by the system . 
Figure 16 shows the  u se  of su ch  com parison  u n its  in  a  sliced com parato r design.
Once sliced com parison is perform ed, the  final resu lt of com pare operation  is 
determ ined by a  separa te  logic c ircu itry  th a t  tak es  into accoun t the  respective o u tp u ts  
of each com pare slice. Therefore, the  control signals for com pare operations listed  in 
Table 1 are m ade available to th is  u n it. The decoder generates an  o u tp u t for the  
com pare instruction . The logic equations th a t  serve som e of these  functions are  show n 
below.
49
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Aeq <= AeqO an d  A eql an d  Aeq2 an d  AeqS 
Aneq <= not Aeq
Agt <t= Agt3 or (Aeq3 and  Agt2) or (Aeq3 an d  Aeq2 an d  Agtl) or (Aeq3 a n d  Aeq2 and  
Aeql an d  AgtO)
Alt <= Aeq nor Agt 
Alteq <= n o t Agt 
Agteq <= Aeq or Agt
opA opB opA opB opA opB opA[7:0] OpB
Signed/ Signed/ Signed/
SLICE-1 SLICE-0SLICE-3 SLICE-2
CO m  cQ 
<  <  <
m m 00 
< <  <
00 CO 00 
< <  <
00 00 00 
< <  <
Figure 16. Four 8-bit com pare slices for signed or unsigned  com parison
Resource allocation vector for each  in stru c tio n  is also in p u t to th is  u n it, and  
equality  is tested  based  on the allocation. For exam ple, if the  RAV for in stru c tio n  A is 
0011, the values re tu rn ed  by Aeq3 an d  Aeq2 are ‘1’, while the  values for Agt3 an d  Agt2 
are  ‘O’. Thus, the  decoding functions th a t  s teer the  o u tp u t of sliced com parato rs into 
correct reg isters are  different for equality  and  g rea ter-than  operation. The final b it 
o u tp u t of the  com pare u n it  is determ ined  by the  equation  for the  function  enabled  by 
in struction  decoder for th a t  in struction . This is th en  concatenated  w ith (N-1) leading 
zeros and  re tu rn ed  as  a n  o u tp u t of the  com parato r un it.
50
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.4  Area analysis
This sliced ALU design requires add itional hardw are for decoders, m ultip lexers and  
added signals. For the  im plem entation of 2:1 m ultiplexers u sed  extensively in the 
design, tran sm ission  gates (pass tra n s is to r  logic) can  be used . These are  designed u sing  
an  NMOS and  a  PMOS tra n s is to r  in  a  configuration th a t re su lt in  no s ta tic  power 
consum ption . Figure 17 show s a  m ultip lexer im plem ented u sing  p ass  tra n s is to r  logic.
s  —
M,
n
Figure 17. M ultiplexer im plem ented u sing  p ass- tra n s is to r  logic
The p ass  tran s is to rs  add  th ree  NMOS and  th ree  PMOS gates to the  hardw are. To 
estim ate the  hardw are u sed  for decoders th a t  perform  direction of in p u t an d  o u tp u t 
signals into correct reg ister slices, the  average cost of decoders w as com puted  in  term s 
of logic gate equivalents. Table 5 lists the  additional hardw are u sed  by various u n its  in 
a  sliced ALU.
On the  whole, additional hardw are  in troduced  for im plem entation of slicing is 
m inim al.
On perform ing a  delay analysis, th e  m axim um  delay pa th  of decoders is found to be 
equivalent to th ree gate propagation  delays. T hus each  decoder ad d s m inim al delay to 
the  execution d a tap a th .
51
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 5 Additional hardw are  u sed  for slicing of ALU
2:1 MUX 4:1 MUX Logic G ates
Adder 4
Adder R esult Decoder 23
C om parator
C om parator Result Decoder 8 44
Shifter 14
Logical O perations
Shifter an d  Logical R esult Decode 23
RAV Decoder 16
Load O perand Decoder 16
Total 26 16 106
4.5 Im plem entation of DLX Sliced P rocessor u s in g  VHDL
In order to evaluate the  block slicing concept in a  processor, it w as im plem ented in 
a  DLX pipeline u s in g  VHDL (VHSIC H ardw are D escription Language).
The first two stages of th e  DLX pipeline, the  fetch and  decode stage, are  sim ilar to 
those  described in  Section 3.2. The d ispa tch  and  execute stages differ from the  original 
im plem entation, while the  reorder an d  re tirem en t u n its  stay  the  sam e. The DLX 
processor h as  been im plem ented a s  a  pipelined, out-of-order, sup e rsca la r processor of 
w idth two. T hus, there  are a t  m ost two in stru ctio n s  in  each stage of the  pipeline a t  any 
given tim e, except the  reorder un it.
Once valid operands are fetched in  the  d ispa tch  stage and  an  in stru c tio n  is ready to 
begin execution, the  n u m b er of u n its  requ ired  for the  in struction  is com puted from the 
value of the  operands. This is done by the  sim ple zero-checking u n it described in 
Section 4 .2.1, w hich checks the  n u m b er of leading zeros of operands. It gives the  length  
of the significant digits of operands an d  hence the  num ber of u n its  required . For 
shifting operation, the nu m b er of u n its  is com puted  by also considering the  value of the 
second operand.
Once the  zero checking is done, the  R esource M apper allocates execution u n it  slices 
to an  in struction . In addition, the  resource  m apper also se ts  the  control signals th a t  
slice an  execution u n it  appropriately. The resource vector is a  b it vector th a t  ind icates
52
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
th e  slices allocated to an  in struction . For exam ple, if in stru c tio n  A is allocated slice 
num ber 1, th en  its 4 -b it resource vector will be 0001. For in stru c tio n  B w ith allocated 
slice n um bers  2 and  3, the  resource vector is 0110. Thus, the  global resource  vector 
during  th a t  clock cycle is 0111, indicating th a t  only th ree  slices of the  execution u n its  
will operate, and  the  fourth  slice will consum e idle power.
D ata is loaded into the operand  reg isters a t the  rising edge of the  clock. D ue to 
block slicing, the  resource m apping control signals slice the execution u n it  an d  the  ALU 
gives a t m ost two o u tp u ts  (ALU O u tp u t A an d  ALU O u tp u t B) by s im ultaneous 
execution of two in structions . These resu lts  are stored  into th e ir  respective reorder 
buffer en tries, and  forw arded if necessary  for the  nex t clock cycle.
The fetch stage is se t so as to fetch the  next set of in stru c tio n s  w hen an  in stru c tio n  
is issued  to a n  execution un it. T hus, w hen instruction-level parallelism  exists in a  
program , the  fetch stage is also speeded u p  an d  the total tim e of execution of a  program  
decreases. In th e  absence of any additional instruction-level parallelism , the  tim e of 
execution of the  program  rem ains the  sam e as th a t  in a  non-sliced processor.
C hap ter 5 p resen ts  the  sim ulation  resu lts  of benchm ark  program s on the  VHDL 
im plem entations of the  DLX m achines w ith non-sliced and  sliced ALU u n it respectively.
53
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5 
RESULTS
The objective of slicing is to increase  th e  u tilization and  n u m b er of functional u n its  
dynam ically. Increasing  nu m b er of functional u n its  leads to a n  increase  in  parallelism  
of execution w hich con tribu tes to speed-up.
The perform ance criteria  u sed  for evaluating  the  concept of slicing are  speed-up , 
th roughpu t, u tilization and  power. These criteria  are  widely u sed  for com parison of 
different a rch itec tu res. To evaluate the  perform ance of the block slicing concept w ith 
respect to these  factors, a  hardw are code for the  DLX processor w as developed u sing  
VHDL and  tested  w ith b enchm ark  program s. B enchm ark  program s were obtained  from 
various sources from in te rnet resources. These were assem bly level program s w ritten  
for the DLX m achine, .asm  files contain ing  b enchm ark  program s were converted to .out 
files using  the  package dbcasm  [27] and  th en  ru n  on the VHDL code of the  sliced 
processor. In stead  of developing the  code from scra tch , the  freely available VHDL 
package dlx-vhdl[28] w as u sed  as  base  code an d  it w as su itab ly  modified for the  
proposed arch itec tu re . Section 4.5 describes the  VHDL processor code.
T hroughput is given by nu m b er of in stru c tio n s  com pleted per u n it  tim e. It can  also 
be related to the  nu m b er “In stru ctio n s  Per Cycle (IPC)”, where the  u n it  of tim e is a  clock 
cycle. C onsidering th a t  a  new in stru c tio n  is fetched eveiy clock cycle, the  n u m b er of 
fetch cycles ind icates the  in p u t s tream  to the  arch itec tu re  an d  the  n u m b er of 
in stru ctio n s  com m itted per fetch cycle ind icates the  o u tp u t s tream  of the  processor. The 
th ro u g h p u t is th en  given as:
Total Number o f  instructions Committed
IPfC  = ----------- ------------------------------------------------ (5.1)
Total Number o f  Fetch Cycles
54
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The speed-up  is com puted w ith respec t to the  DLX arch itec tu re  w ithou t the  
p rocessor m odifications for block slicing. T hus, speed-up  is given as:
Time o f  execution on non -  sliced D LX architecture
Speed - u p  = ---------    (5.2)
Time o f  execution on sliced D LX architecture
Resource utilization a t the  bit-level is given by the % of resource u sed  du ring  tim e of 
execution. Resource u tilization can  be given in term s of the ratio  of nu m b er of tim es the 
resource slices were completely u sed  to the  to tal num ber of tim es the  resource  w as 
accessed.
Power consum ed  during  execution of two sequen tial operations is evaluated  u sing  
the  Xilinx Xpower tool th a t is included w ith Xilinx ISE. The power-delay p roduct is th en  
u se d  to com pare the  non-sliced and  the  sliced arch itec tu res.
5.1 Time of Execution an d  Speed-Up
The above m entioned criteria  were evaluated  on ten  benchm ark  program s and  are 
p resen ted  below.
Table 6 R esults of evaluation  of Time of Execution anc
B enchm ark  Program Time of execution (us) Speed-up G ain
%non-sliced sliced
A L U instru tions-1 93.5 47.5 1.968 49.198
A LU instrutions-2 137.5 95.5 1.440 30.545
DLX 61.5 58.5 1.051 4.878
Load Store 119.5 119.5 1.000 0.000
Prim eN um ber 6595.5 6471.5 1.019 1.880
supsca l 63.5 39.5 1.608 37.795
M D U instructions 198.5 191.5 1.037 3.526
B ran ch Ju m p 85.5 67.5 1.267 21.053
NtoK 76.5 62.5 1.224 18.301
Speed-up
Table 6 p resen ts  the  speed-up  obtained  for the  b enchm ark  program s by listing  tim e 
of execution of each  b enchm ark  on a  non-sliced an d  sliced processor and  u s in g  eqn.2.
55
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5.2 Efficiency
Table 7 p resen ts  the  efficiency of u se  of ALU slices. In a  non-sliced im plem entation, 
each  tim e the  ALU is accessed , bo th  po ten tia l slices are  accessed. In a  sliced ALU, each  
tim e two in stru ctio n s  are executed in parallel, they  are assum ed  to u se  two slices each, 
resu lting  in en tire  length  of ALU being used .
Let,
^^ALu ~ Number o f times ALU is accessed in non - sliced implementation 
^Aw -  Number o f times ALU is accessed in sliced implementation 
ALU-Slice ~ Number o f times potential ALU slices are accessed in non - sliced 
implementation
S^Lu-siice -  Number o f ALU slices accessed in sliced implementation
p  = Number o f times ALU instructions executed in PARALLEL in sliced implementation
n = Total Number o f  ALU instructions
N S alu-succ (Column5) is given as:
NSALu-siice=NSALu (Column2) x 2
Also, = 2 X
Table 7 Efficiency
Benchmark Program # of ALU accesses # of parallel 
executions
# of ALU slices accessed # of ALU 
nstruction:
Efficiency
6s
Gain in 
EfficiencyNSalu Salu NSALUSlice SALU-Slice
ALUinstrutions-1 22 13 9 44 26 22 0.846 69.23%
ALUinstrutions-2 33 31 2 66 62 33 0.532 6.45%
DLX 9 7 2 18 14 9 0.643 28.57%
LoadStore 5 5 0 10 10 5 0.500 0.00%
PrimeNumber 718 681 37 1436 1362 718 0.527 5.43%
supscal 14 8 6 28 16 14 0.875 75.00%
MDUinstructions 27 26 1 54 52 27 0.519 3.85%
BranchJump 21 14 7 42 28 21 0.750 50.00%
NtoK 16 15 1 32 30 16 0.533 6.67%
Thus, Efficiency S  is given by:
56
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
^NS ~
A L U -S lic e
and
~  '
A L U -S lic e
5.3 T hroughput
Table 8 show s the  th ro u g h p u t of bo th  im plem entations in  term s of in stru ctio n s  per 
fetch cycle.
Table 8 T hroughpu t in  term s of Instruction Per Fetch Cycle
B enchm ark Program # of fetch cycles # of instructions IPfC Gain in
non-sliced sliced non-sliced sliced IPfC (%)
A LUinstrutions-1 23 14 22 0.957 1.571 64.286
ALUinstrutions-2 34 32 33 0.971 1.031 6.250
DLX 27 25 18 0.667 0.720 8.000
LoadStore 30 30 30 1.000 1.000 0.000
PrimeNumber 1693 1660 1321 0.780 0.796 1.988
supscal 17 11 16 0.941 1.455 54.545
M DUinstructions 71 70 39 0.549 0.557 1.429
B ranchJum p 34 29 28 0.824 0.966 17.241
NtoK 34 33 24 0.706 0.727 3.030
5.4 Power-Delay Product
For estim ation  of power consum ption , the  Xilinx XPower tool w as u sed  w ith 
synthesizable designs of sliced ALU an d  non-sliced ALU. The ALU is capable of 
perform ing addition  /  su b tra c tio n , shift, com pare an d  logical operations. Every 
com bination of two different operations w as selected an d  sim ulated  w ith w orst case  16- 
b it operands. The operations of addition  an d  com parison were found to consum e m ost 
power. The ALU designs were th en  analyzed for power consum ption  du ring  execution of 
the  operations of addition and  com parison of 16-bit operands sequentially  on a  non- 
sliced ALU an d  parallelly on a  sliced ALU.
Table 9 show s the  power-delay p roduct du ring  th is  analysis.
57
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 9 Power Delay Product for execution of two w orst-case operations for two 16-bit
Power (mW) Delay (ns) Power-Delay Product
Non-Sliced ALU 431 20 8620
Sliced ALU 604 10 6040
Figure 18 show s a  sn a p sh o t of waveform s sim ulated  on DLX processor for the  ALU 
integer benchm ark  A LU instructions-Part 1, w ith the  fetch registers, com m it signals and  
ALU issue  signals shown. Figure 5.1(a) show s the  sim ulation  ru n  for a  DLX processor 
w ith non-sliced ALU and  figure 5.1(b) show s the sim ulation ru n  for the  DLX processor 
w ith sliced ALU.
Name 0 ID 20 30 40 50 op 70 80us
IncomingÇlock
IF_lnstrAatlrRegA I__ X:,X X: X, X X, X X X X X X X X X X X X X
IFJnstrAddrRagB K_ JUJOLX X X X X X X X X X X X X X X X X X X :
CU_CommltlnstrA .........n . n n . f i n n n n n n n n n n n n n n n n n n
CU_CommitlnstrB
:DP_ExecuteOrlssuelnstrA ___A...:n..n.n.n.n n n n n n n n,n n n n n n n n n n
DP_ExecuteOrlssuelnstrB
DPJssueA luA .. iAa...n.n.n„n n n n ..non n n n
DPJssueA luB
A L U js s u e . . . . . . n n  n n n n n n n n n n o n n n n n n n n n
Figure 18 (a)
N a m e
p 5 to 15 20 25 30 35 40 45 50 55' . 1 • > 1 • . 1 • 1 1 1 1 1 1 * 1 I 1 • 1 > 1 . I (
. •. . r . , .
IncomingClock
IF_lnstrAddrRegA <______ X X  X X X n c T  r  x - x '  X x  xooopooGo
IF_lnslrAddrRegB i c : x . . . . x  X X X a  X X X X x
CU_CofnmitlnstrA n n n n n n n n n n n n n
CU_CommHln®trB n . n....  n n n n n n n
ALU_A_lssue JL_a.n..n n n n n n n n  n n
ALU_B_lssue n n n n n n n n n
Figure 18(b)
Figure 18 W aveforms of sim ulations for A L U instructions-Part 1.o u t for DLX processor 
w ith (a) non-sliced ALU an d  (b) sliced ALU
58
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
IFJnstrA ddrR egA  and  IFJnstrA ddrR egB  are  add ress registers u sed  by In struction  
Fetch stage. E ach  con ta ins an  add ress of a n  in struction  to be fetched. W henever the 
con ten ts of these  registers change, the  in stru ctio n s  p resen t a t the  add resses  stored  by 
the  registers are  fetched by the  Fetch u n it. T hus, a t  m ost two in stru c tio n s  (Instruction 
A and  Instruction  B) are  fetched a t  a  tim e. If, in a  previous clock cycle, only one 
in struction  (Instruction A) is able to execute, th en  con ten ts of IFJnstrA ddrR egB  are 
transferred  to IFJnstrA ddrR egA , an d  IFJnstrA ddrR egB  fetches a  new  in struction . Both 
in structions fetched are  th en  decoded in  the  Decode Unit. The in stru c tio n s  are  ready to 
execute w hen all their operands are  fetched. Ready in structions  are  issued  to execution 
u n its  w hen u n its  are  available. The issu e  to ALU u n it is signaled by A L U jssu e  signal. If 
in struction  A is to be issued , th en  D PJssueA luA  is high, an d  if in stru c tio n  B is to be 
issued , th en  D P JssueA luB  is high. Both signals canno t be high sim ultaneously  for a  
p rocessor with a  non-sliced ALU. However, if the  n a tu re  of operands allows it, bo th  
signals will be high sim ultaneously  for a  p rocessor w ith a  sliced ALU. After in structions 
a re  executed, their resu lts  are  stored  in  destina tion  registers an d  the  in stru c tio n  is 
m arked  for re tirem en t u sing  the  com m it signals CU_CommitInstrA and  
CU_Com mitlnstrB. The two-width pipeline is capable of com m itting a t  m ost two 
in stru ctio n s  a t a  time.
The fetch add ress reg isters are  loaded after an  in struction  is m arked  for issu e  by 
the  d ispa tch  un it. Thus, a n  in struction  is fetched, decoded nd  d ispatched  in  the  From  
Figure 18(b), it can  be seen th a t  there  are  n ine in stan ces  w hen bo th  A LU_AJssue and  
A LU _BJssue were high, w hich ind icates th a t  du ring  nine cycles, the  ALU in stru ctio n s  
p resen t in reservation  u n its  were executed sim ultaneously .
CU_CommitlnstrA and  CU_Com mitInstrB com m itted two in structions  a t  a  tim e in n ine 
in stances  in  th e  sliced ALU processor, while it com m itted a t  m ost one in stru c tio n  a t  a  
tim e in a  non-sliced ALU processor.
The sim ulation  waveforms show  th a t  slicing ex tracts in struction  parallelism  p resen t 
in the  program , and  reduces in stru c tio n  sta lls  th a t occur due to resource bottlenecks.
59
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The gain percentages p resen ted  in la s t co lum ns of all resu lt tab les give an  ind ication  of 
unresolved parallelism  p resen t in  program s th a t w as only extracted  after slicing.
60
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 6 
CONCLUSIONS AND FUTURE WORK
6.1 Conclusions
The concept of resource slicing w as im plem ented in  the DLX processor u sin g  VHDL. 
Sliced resources process g reater n u m b er of in stru ctio n s  w ithout the  need to add  ex tra  
hardw are resources. The sliced resource im plem entation w as evaluated  w ith respect to 
speed-up , th ro u g h p u t, power and  utilization of the  integer un it.
From the resu lts  th u s  obtained , it can  be observed th a t  by addition  of one low- 
latency stage, the  Resource M apping and  m inim al hardw are, it is possible to obtain  a  
speed-up  and  h igher efficiency of execution. The nu m b er of functional u n its  requ ired  to 
be pipelined in a  sup e rsca la r pipeline can  also be reduced if the  ta sk  ru n n in g  on the  
p rocessor allows it. For a  generic processor th a t  ru n s  a  variety of different applications, 
each  requiring different n u m b er of functional u n its , th is  can  provide a  flexible schem e 
for efficient execution.
The Intel MMX arch itec tu re  w as also developed w ith the  pu rpose  of parallelizing 
execution of in stru ctio n s  on d a ta  w ith sm aller w idth th a n  word size of the  processor. 
The sliced a rch itec tu re , if evaluated  w ith MMX-type d a ta  will also perform  similarly. 
Unlike the MMX, the  sliced a rch itec tu re  will no t require additional MMX-specific 
in structions  and  will dynam ically slice itself in to  m ultiple m odules to process the  data .
6.2 F u tu re  W ork
It is necessary  to evaluate the  perform ance en hancem en t obtained  a t  varying 
su p ersca la r w idths on m ore b enchm arks th a n  u sed  here. This will help in determ ining
61
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
th e  optim al n u m b er of slices required  for different applications. This n u m b er can  th en  
be used  to design sliced p rocessors for m ost efficiency.
Block slicing is a  general concept th a t  can  be applied in a  variety of form s to 
m odules o ther th a n  functional u n its . It m ay be applied to reg isters an d  caches. It is 
required  to design a  su itab le  hardw are  to add ress, identify and  access sliced d a ta  w hen 
stored in sliced registers and  caches. A com plete sliced processor will be obtained  once 
w ork is perform ed for slicing these  m odules.
62
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
REFERENCES
1] Jo h n  L. H ennessy and  David A. P atterson , “Com puter A rchitecture: A Q uantita tive  
A pproach”, M organ K aufm ann Publishers, Third Edition, 2002.
2] David J . Lilja, “M easuring C om puter Perform ance, A P ractitioner’s guide”, 
C am bridge University P ress, 2000.
3] Luigi D adda, “The Evolution of C om puter A rchitectures”, IEEE Com pEuro 1991.
4] C harles Moore an d  CH.TING “M inimal In struction  Set C om puter”, F ourth  
D im ensions, J a n u a ry  1995.
5] A nthony Fong, “HISC: A High Level In struction  Set C om puter”, 7‘h E uropean  
Sim ulation Sym posium , 406-410, Society for C om puter S im ulation, O ctober 95.
6] J -P  LeBouquin, IBM M icroelectronics ZISC, “Zero In struction  Set Com puter, 
Prelim inary Inform ation”, WCNN, S an  Diego, CA 1994.
7] W ayne Wolf an d  Jo rgen  S ta u n stru p , “H ardw are/Softw are co-design Principles and  
P ractice”, Kluwer Academic Pub lishers, 1997.
8] M.H. Lipasti an d  J .P . Shen, “Superspeculative M icroarchitecture for Beyond AD 
2000”, IEEE Com puter, Septem ber 1997.
9] J .E . Sm ith  an d  S. Vajapeyam , “Trace processors: Moving to F o u rth — G eneration 
M icroarchitectures”, IEEE C om puter, Septem ber 1997.
10] G. S. Sohi, S. B reach, and  S. V ijaykum ar, “M ultiscalar p rocesso rs”, in  Proceedings 
of the  22nd  A nnual In ternational Sym posium  on C om puter A rchitecture, J u n e  
1995.
11] S. K axiras, D.C. Burger, J.R . Goodm an. “D ataScalar: A M emory-Centric A pproach 
to C om puting”, Jo u rn a l of System  A rchitecture (JSA), special issue  on 
M icroprocessor A rchitecture, J u n e  1999.
12] B urger D., G oodm an J .,  “Billion-transistor architectures”, C om puter, Septem ber 
1997.
13] Doug Burger, Ja m e s  R. G oodm an, “Billion-Transistor A rchitectures: There an d  
Back Again”, IEEE C om puter, 2004.
14] Yale N Patt, Sanjay  J  Patel, M arius Evers, Daniel H. Friendly, J a re d  S tark , “One 
Billion T ransisto rs , One U niprocessor, One Chip”, IEEE Micro, pp. 51-57, 
Septem ber 1997.
15] M ichael J  W irthlin, B rad L. H utchings, “A Dynam ic In struction  Set C om puter”, 
IEEE Sym posium  on FPGAs for C ustom  C om puting M achines, 1995.
6 3
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[16] A lessandro De Gloria, “VISA; A Variable In struction  Set A rchitecture”, ACM 
SIGARCH C om puter A rchitecture News, Vol. 18, Issue 2, 1990
[17] C hand ra  S hekhar, Raj Singh, A.S. M andai, S.C.Bose, Ravi Saini, P ram od Tanw ar, 
“Application Specific Instruction  Set Processors: Redefining H ardw are-Softw are 
B oundary”, Proceedings of the  17‘^  In ternational Conference on VLSI Design, 
IEEE.
[18] C hris Weaver, Rajeev K rishna, Lisa W u, an d  Todd A ustin  , “Application Specific 
A rchitectures: A Recipe for F ast, Flexible and  Power Efficient D esigns”, 
In ternational Conference on Com pilers, A rchitecture, and  Synthesis for Em bedded 
System s (CASES’Ol), November 2001.
[19] P.H.W. Leong, P.K. T sang an d  T.K. Lee, “A FPGA Based Forth  M icroprocessor”, 
Proceedings of the  IEEE Sym posium  on Field-Program m able C ustom  C om puting 
M achines (FCCM), N apa Valley, California USA, pp. 254-255, 1998
[20] Shay Ping Seng, W ayne Luk, Peter Y.K. C heung, “Flexible Instruction  P rocessors’” 
CASES 2000, November 17-19,2000, San  Jo se , California.
[21] Frederik V erm eulen, Francky C atthoor, Lode Nachtergaele, D iederik Verkest, 
Hugo De M an, “Power-Efficient Flexible Processor A rchitecture for Em bedded 
A pplications”, IEEE T ransactions on VLSI System s, Vol 11, No.3, J u n e  2003.
[22] Lecture slides by Minyi Guo, The University of Aizu, available online a t the  URL: 
h ttp ://w w w .u -a iz u .a c .jp /~ m in y i/c o u rse /p a ra 2 0 0 1 .pdf
[23] In te rnet Resource: abrak .doc
[24] AMD® K5™ processor, a t  URL: h ttp ://w w w .a m d .c o m /u s-  
e n /  as  se ts /  content_type /  w hite_papers_and_tech_docs /  20092 .pdf
[25] Intel® Pentium ®  Pro page a t www.intel.com
[26] D. Levitan, T. Thom as, an d  P. Tu., “The powerpc 620 m icroprocessor: A high 
perform ance su p ersca la r rise m icroprocessor”. Proceedings of the  4 0 th  IEEE 
C om puter Society In ternational Conference, pg. 285-291, 1995.
[27] Com piler for DLX in struction  set, dbcasm  package available a t URL: 
h ttp : /  /  w w w .ashenden .com .au/designers-guide/D G -D L X -m aterial.h tm l
[28] VHDL-DLX package available freely a t URL: h ttp ://w w w .rs .e -tech n ik .tu - 
d arm stad t.de  /TUD /  res /  d lxdocu /  SuperscalarD LX .htm l
64
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
VITA
G raduate  College 
U niversity of Nevada, Las Vegas
S h ru ti R avikant Patil
Home Address:
969 E as t Flamingo Road Apt# 101 
Las Vegas, Nevada 89119
Degree:
B achelor of Engineering, C om puter Engineering, 2004 
University of M um bai
Special H onors an d  Awards:
M ember, Tau B eta Pi, Nevada C hapter, In itia ted  in Fall 2005 
Recipient of the  National Talent Search  Scholarship(India), 1998 
R ank 6, Regional M athem atics Olym piad, 1998, M um bai 
Recipient of Bom bay Talent Search  Award, 1997
Publications:
“HAUNT-24: H ierarchical, Application Confined U nique Nam ing Technique”, IEEE 
Conference Proceedings of Fifth In ternational Conference on Intelligent System s 
Design an d  Applications (ISDA 2005), 8-10 Sept. 2005, Poland.
“S im ultaneous C olum n M inim ization-Encoding approach  for Serial D ecom position”, 
IEEE Proceedings of In ternational Conference on C om putational Intelligence and  
M ultim edia Applications Conference (ICCIMA), A ugust 2005, Las Vegas, NV, USA
“B ranch  Prediction by Checking Loop Term inal C onditions”, Inform ation System s: 
New G enerations (ISNG) Conference Proceedings, April 2005, Las Vegas, NV, USA
“Neutron D etector C haracteristics in  D ead Time Experim ents”, p resen ted  a t  2005 
Am erican N uclear Society S tu d en t Conference, April 15, 2005. Awarded the  second 
prize as  B est S tu d en t Paper Presentation .
Thesis Title: M aximizing Resource U tilization By Slicing Of S upersca lar A rchitecture
Thesis Exam ination Committee:
C hairperson, Dr. V enkatesan  M uthukum ar, Ph. D.
Com m ittee M ember, Dr. E m m a Regentova, Ph. D.
Com m ittee M ember, Dr. S hah ram  Latifi, Ph. D.
G raduate  Faculty  R epresentative, Dr. Ajoy D atta , Ph. D.
65
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
