A Case for a Complexity-Effective, Width-partitioned Microarchitecture by Rochecouste, Olivier et al.
HAL Id: inria-00000211
https://hal.inria.fr/inria-00000211
Submitted on 13 Sep 2005
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
A Case for a Complexity-Effective, Width-partitioned
Microarchitecture
Olivier Rochecouste, Gilles Pokam, André Seznec
To cite this version:
Olivier Rochecouste, Gilles Pokam, André Seznec. A Case for a Complexity-Effective, Width-
partitioned Microarchitecture. [Research Report] PI 1742, 2005, pp.27. ￿inria-00000211￿
I 
  
R
  
 I 
  S
   A
IN
S
T
IT
U
T
 D
E
 R
E
C
H
E
R
C
H
E
 E
N
 IN
F
O
R
M
A
TI
Q
U
E 
ET
 S
YS
TÈ
M
ES
 A
LÉ
AT
OI
RE
S
P U  B  L  I  C  A  T  I  O  N
I  N  T  E  R  N  E
No
I R I S A
CAMPUS UNIVERSITAIRE DE BEAULIEU - 35042 RENNES CEDEX - FRANCEIS
S
N
 1
1
6
6
-8
6
8
7
1742
A CASE FOR A COMPLEXITY-EFFECTIVE,
WIDTH-PARTITIONED MICROARCHITECTURE
OLIVIER ROCHECOUSTE , GILLES POKAM , ANDRÉ
SEZNEC
INSTITUT DE RECHERCHE EN INFORMATIQUE ET SYSTÈMES ALÉATOIRES
Campus de Beaulieu – 35042 Rennes Cedex – France
Tél. : (33) 02 99 84 71 00 – Fax : (33) 02 99 84 71 71
http://www.irisa.fr
A Case for a Complexity-Eetive, Width-partitionedMiroarhitetureOlivier Roheouste* , Gilles Pokam ** , André Sezne ***Systèmes ommuniantsProjet CAPSPubliation interne no1742  Août 2005  27 pagesAbstrat: Current supersalar proessors feature 64-bit datapaths to exeute the programinstrutions, regardless of their operands size. Our analysis indiates, however, that most ex-eutions omprise a large amount (40%) of narrow-width operations; i.e. instrutions whihexlusively proess narrow-width operands and results. We further notied that these opera-tions are well distributed aross a program run. In this paper, we exploit these properties tomaster the hardware omplexity of supersalar proessors. We propose a width-partitionedmiroarhiteture (WPM) to deouple the treatment of narrow-width operations from thatof the other program instrutions. We split a 4-way issue proessor into two lusters: oneexeuting 64-bit operations, load/store and omplex operations and the other treating the16-bit operations. We show that revealing the narrow-width operations to the hardware issuient to keep the workload balaned and the ommuniations minimized between lus-ters. Using a WPM redues the omplexity of several ritial proessor omponents : registerle and bypass network. A WPM also lowers the omplexity of the interonnetion fabrisine the 16-bit luster is only able to propagate narrow-width data. We examine simpleongurations of WPM while disussing their tradeos. We evaluate a speulative heuristito steer the narrow-width operations towards lusters. A detailed omplexity analysis showsusing a WPM model saves power and area with a minimal impat on performane.Key-words: Hardware omplexity, power onsumption, supersalar proessor, width-partitioned miroarhiteture, register le, narrow-width operations, data-width preditor(Résumé : tsvp)* oroheoirisa.fr** gpokams.usd.edu*** sezneirisa.fr
Centre National de la Recherche Scientifique Institut National de Recherche en Informatique
(UMR 6074) Université de Rennes 1 – Insa de Rennes et en Automatique – unité de recherche de Rennes
Résumé : Les proesseurs supersalaires atuels implémentent des hemins de données64-bit pour exéuter les instrutions d'un programme, indépendamment de la taille desopérandes. Notre analyse indique toutefois que la plupart des exéutions omportent unefration onsidérable (40%) en opérations tronquées ; -à-d. les instrutions manipulant ex-lusivement des opérandes et des résultats de petite dimension. Nous avons aussi remarquéque les opérations tronquées sont bien distribuées au ours d'une exéution. Cette étudeexploite es propriétés pour maîtriser la omplexité des proesseurs supersalaires. Pour efaire, nous proposons une miroarhiteture lusterisée (WPM) pour déoupler le traitementdes opérations tronquées de elui des autres instrutions du programme. Nous partitionnonsainsi le proesseur entre deux lusters : un luster 64-bit exéutant les opérations 64-bit,load/store et omplexes, et un luster 16-bit traitant les opérations 16-bit. Considérant lespropriétés relatives aux opérations tronquées, nous montrons que révéler es dernières aumatériel est susant pour maintenir l'équilibrage des harges et minimiser les ommuni-ations entre les lusters. Le modèle WPM réduit eaement la omplexité de plusieursomposants ritiques du proesseur : hier de registres, réseau de bypass. Ce modèle ré-duit aussi la omplexité du réseau d'interonnexion ar le luster 16-bit peut uniquementpropager des données 16-bit. Nous examinons diérentes ongurations du WPM en dis-utant de leurs ompromis. Nous évaluons une heuristique spéulative pour distribuer lesinstrutions vers les lusters. Une analyse détaillée de la omplexité indique que le mod-èle WPM réduit la onsommation et la surfae de siliium ave un impat minimal sur lesperformanes.Mots lés : Complexité matérielle, onsommation életrique, proesseur supersalaire,miroarhiteture lusterisée, hier de registres, opérations tronquées, préditeur de largeur
WPM 31 IntrodutionInrease in proessor performane is strongly orrelated to exploiting large ILP and sus-taining fast lok rates. Although proessor designers have been able to keep up with thisperformane growth during the past few deades, it now beomes inreasingly hallenging todo so without overoming several major obstales. Among these, the impat of wire delays[12℄ is expeted to prevail as devie features beome smaller. Other major fators that pre-lude yielding higher lok frequenies inlude the physial register le, the bypass network,the wakeup and the seletion logi [21℄. In addition, as the omplexity of these struturesinreases dramatially with larger issue widths, the impat on power onsumption and areaalso appears to be a serious matter.Researhers have proposed a large number of solutions to overome the aforementionedissues. Some studies have onsidered partitioning the hardware resoures into lusters ofomputational units to redue the overall omplexity [21, 9, 1℄. In these studies, the parti-tioning is ditated by the need to break the omplexity growth fator of the ritial ompo-nents by reduing their sizes. Hene, the resulting lusters have simpler strutures, therebyenabling fast lok rates. However, a major bottlenek with this approah is the interonnetfabri used to ommuniate data between lusters. This interonnet fabri is relatively slowand dissipates a large amount of power [19℄. It is therefore desirable to minimize the numberof inter-luster ommuniations while also keeping the workload among lusters balaned.Other studies have onsidered a areful design of the ritial proessor omponents toredue this omplexity. These studies are mainly direted by empirial analysis made onruntime data, suh as the seminal observation made by Brooks et al. [4℄ that most applia-tions only need part of the full datapath-width to exeute. Several optimizations have beenproposed whih exploit this narrow-width operand property of programs to redue poweronsumption [4, 24, 5, 14℄ or to improve performane [8, 17, 25, 18, 20℄. While they doatually help reduing the omplexity of ertain ritial proessor omponents (e.g. the reg-ister le), quantifying their impat on the overall miroarhiteture is more diult. Thisis beause many of these proposals feature omplex implementations, sometimes requiringmajor hanges to the hardware.This paper proposes to make eient use of the available silion by exploring new pos-sibilities of partitioning a miroarhiteture based on narrow-width data. Central to ourapproah is the observation that the ourrene of narrow-width operations, i.e. instrutionsexlusively omprising narrow-width operands, and the other program instrutions is rela-tively balaned and highly interleaved aross a omplete program run. We observed thisprogram property on the MediaBenh and SPEC2000 benhmarks. Beause of the relativeprevalene of these narrow-width operations in programs, about 40% of the instrutionsexhibit this property for the onsidered benhmarks, we suggest to use a width-partitionedmiroarhiteture (WPM) to master the hardware omplexity of supersalar proessors. Ina WPM, we resort to partitioning to deouple the treatment of the narrow-width operationsfrom that of the other program instrutions. This provides the benet of greatly simplifyingthe design of the ritial proessor omponents in eah luster (e.g. the register le) as noadditional hardware is required for managing eah type of instrution; yet, the interleav-PI no1742
4 Roheouste, Pokam & Sezneing of the two instrution types balanes the workload among the lusters. We also showthat WPM redues the omplexity of the interonnet fabri. In fat, sine lusters withnarrow-width datapath an only ommuniate narrow-width data, the datapath-width of theinteronnet fabri is signiantly redued, yielding orresponding saving of the interonnetpower and area. We present an eient design of WPM, disussing various implementationhoies inluding steering heuristis to distribute instrutions among the lusters and a de-tailed analysis of the omplexity fators aeting the performane, power and area. Ouromplexity analysis shows that using a WPM arhiteture instead of a lassial 64-bit 2-luster miroarhiteture an indeed save power and silion area with only a minimal impaton the overall performane.The remainder of this paper is organized as follows. Setion 2 elaborates on the motiva-tions of this work, providing some intuitive observations about the rationale of our approah.WPMs are desribed in detailed in Setion 3, while their omplexity analysis is disussedin Setion 4. The instrutions steering mehanism is presented in Setion 5.2. Results arepresented in Setion 6, while Setion 7 disusses the related work. We onlude in Setion8.2 MotivationsIn reent works, several authors [4, 18, 24, 8℄ have pointed out the large availability ofnarrow-width data within ompute-intensive integer and multimedia programs. To exploitthis program property, various denitions of the operations exeuting with narrow-widthoperands have been assumed; depending on their appliation to the arhiteture. Brooks etal. [4℄ have qualied a narrow-width operation as an instane where both soure operandsan be represented with fewer than 16 bits, whereas Pokam et al. [24℄ onsidered the basi-blok granularity to dene narrow-width regions in a program. We formulate a dierentassumption that onsiders a narrow-width operation to be an operation where no operandexeeds 16 bits, inluding the destination operand.Charaterizing narrow-width operations We have quantied the number of our-renes of these narrow-width operations aross the Mediabenh and the SPEC2000 benh-marks. Our bitwidth analysis is exlusively devoted to operations proessed through theinteger funtional unit, inluding the address alulation. Operations that exeute withnarrow-width operands in the two's omplement form are also onsidered. Figure 1 reportsthe lassiation and the distribution of the integer operations using narrow-width operands.As a onvention, we note N for a narrow-width operand and F for a full-width operand. Weuse a 3-letter notation for ategorizing an operation: the two leading letters represent thewidth type of the soure operands and the last letter is the width type of the result. For in-stane, NFN stands for an operation whih proesses a narrow-width an a full-width soureoperands and produes a narrow-width result. For the monadi operations, we onsider thatall of the soure operands feature the same width.
Irisa
WPM 5
Figure 1: Classiation and distribution of integer operations using narrow-width data.We observe from Figure 1 that a signiant part of the integer exeution is devotedto the narrow-width operations (NNN), about 40%. These results orroborate the priorobservations made by Brooks regarding the prevalene of narrow-width data for the integeroperations. This suggests a sheme that deouples the proessing of narrow-width operationsonto dediated narrow operators. This would redue signiantly the omplexity of ertainproessor omponents lying on the ritial path. In this study, we advoate the use ofdeoupling the proessing of narrow-width operations onto dediated narrow-width lusters,where one or more lusters feature a narrow datapath-width. We refer suh a partitionedmodel as a width-partitioned miroarhiteture (WPM). As for a onventional partitionedarhiteture, a WPM alls for a proper steering mehanism to distribute the narrow-widthoperations. It is ruial for both performane and power that the steering heuristis balanethe workload among lusters while minimizing inter-luster ommuniations.Inter-luster ommuniations Figure 1 also provides an estimate of the average numberof ommuniations that take plae within a WPM. An inter-luster ommuniation in WPMan be triggered if an operation onsumes a narrow-width value produed in a remote luster(e.g. NNF, NFN, NFF) or if it produes a narrow-width value that must be propagated toa remote luster (e.g. NFN, FFN). As shown in Figure 1, this onerns roughly 20% of theinteger operations. For a WPM, this might translate into the worst-ase senario where oneoperation out of ve triggers an inter-luster ommuniation. However, this is a maximalbound sine this is strongly orrelated with the narrow-width operations distribution and thedata dependeny among operations, i.e. not all the narrow-width operations have a datadependeny with the other larger width program instrutions. Our result setion indeedshows that the number of inter-luster ommuniations is far below this bound.
PI no1742
6 Roheouste, Pokam & Sezne
Figure 2: Distane between narrow-width operations and the other program instrutions atruntime.Workload balane Another relevant task for the instrutions steering mehanism is toguarantee a good workload balane among lusters. We have approximated the workloadbalane that a WPM might be subjet to as follows. For eah operation, we have olletedthe distane separating a narrow-width operation from the next operation that exeuteswith larger data. Figure 2 displays the mean of the most frequent distanes observed overall benhmark appliations at runtime. The standard deviation aross appliations is alsoreported and reveals the strong orrelation of narrow-width distribution between applia-tions. Another phenomenon illustrated in Figure 2 is the dominane of short distanes atruntime. This may be due to the fat that we also inluded address alulations whihfrequently soliit the full datapath-width. This might therefore mean that ourrenes ofnarrow-width operations are highly interleaved with the other operations in program exeu-tion. From a WPM viewpoint, this means that a simple steering heuristi may be able toahieve a balaned workload.3 Width-partitioned MiroarhitetureMost integer and multimedia appliations exhibit a large fration of narrow-width operationsthat are also well distributed aross the exeution. To take advantage of this programproperty, we examine a novel partitioned arhiteture that an eiently operate on narrow-width operations as well as on the other program instrutions, with redued omplexity. Werefer to this novel partitioned organization as width-partitioned miroarhiteture (WPM).This setion desribes the implementation of suh a 4-way WPM design. One an easilyonsider saling up this design to larger issue-width. To do so will require some modiationsto the inter-luster ommuniation model. This is however beyond the sope of this paper.
Irisa
WPM 7
Figure 3: Baseline organization Figure 4: WPM organization3.1 Baseline modelOur baseline model is derived from the Alpha 21264 [13℄. It is a 64-bit, out-of-order, dual-luster mahine. We assume that the oating-point operations are proessed in a dediatedluster not desribed in this paper. Figure 3 shows the blok diagram of this baselineorganization. As depited in the gure, the proessor front-end (feth, deode and rename)and the data ahe are shared by all lusters. Similarly to the Alpha 21264 [13℄, we assumethat the issue queues are deoupled from the reorder buer and partitioned among lusters.The other omponents omprise the funtional units and the register le whih is dupliatedonto eah luster. Both lusters are apable of issuing up to two instrutions per yle.Every 64-bit ALU an treat omplex instrutions suh as multipliation or shift operations.We assume that the sheduling of memory operations is restrited to a single luster. Inaddition, we onsider that the load/store unit is apable of proessing integer and logioperations. Sine we examine a dual-luster implementation, a fully-onneted topologyis advoated to irumvent potential resoures ontention and maximize performane. Forsupporting this topology, eah register le (RF) opy must feature a number of write portsequal to the total number of ALUs, i.e. 4-write ports per luster.One fethed and deoded, instrutions are proeeded by the renaming stage. At thisstep, the steering logi is responsible for dispathing the instrutions to the proper luster.We rely upon an instrution steering heuristi similar to [7℄ whih steers instrutions to theluster that produes most of its operands if this luster omprises the proper funtionalunit. An instrution an only aess its soure operands from the loal RF. We assume thatinter-luster ommuniations are impliitly done by propagating every results to the loaland the remote RF. For the produing luster, data are bypassed in the same yle to allowPI no1742
8 Roheouste, Pokam & Sezne
Figure 5: Inter-luster ommuniation senario. The sux indiates the width of theoperand.bak-to-bak exeutions, whereas broadasting data to the other luster takes additionalyles.3.2 WPM designThe basi WPM design onsidered in this study splits the integer ore into two distintlusters: (1) a main full-width luster featuring a 64-bit datapath and (2) a narrow-widthluster featuring a 16-bit datapath. As in the baseline model, eah luster is omposed of aset of funtional units (FUs) and a loal RF. The narrow-width luster features two 16-bitALUs and a loal 16-bit RF (alled narrow-width RF). As shown in Setion 4, this organiza-tion dramatially redues the overall proessor omplexity as there is no need for additionalhardware to keep trak of the dierent datapath-width exeution modes. The narrow-widthRF has four read ports and two write ports to provide support for the exeution of twooperations per yle. The full-width luster, on the other hand, omprises a 64-bit ALU andone load/store unit apable of exeuting simple arithmeti and logi operations. A 64-bitloal RF (alled full-width RF) is provided with four read ports and two write ports to sup-port the exeution of two 64-bit operations per yle. Restriting the load/store unit to thefull-width luster is oherent with our approah sine address alulations generally operateon the full datapath-width. Figure 4 illustrates this basi WPM organization. Similar to thebaseline model, we do not onsider partitioning the proessor front-end and the data ahe.We do however need to address with are the ommuniation between the narrow-widthluster and the full-width luster.3.2.1 Inter-luster ommuniationsThe need to ommuniate data between the narrow-width luster and the full-width lusteris ditated by the propagation of data dependeny among the narrow-width operationsand the other program instrutions. Consider for instane the exeution senario depited
Irisa
WPM 9in Figure 5. Operation IN0 on the narrow-width luster produes a value that is laterneeded by operation IF1 exeuting on the full-width luster. The result of this operationis then onsumed on the narrow-width luster by operation IN2. These edges atuallylabel the data dependenies between the operations. The number of suh edge is the ut-size between the set of narrow-width operations and the other program instrutions and isatually a maximum bound on the total amount of inter-luster ommuniations, e.g. threeommuniations in the given example.A rst naive implementation is to make eah FU on eah luster be write-onneted tothe RF of the remote luster. This will add four additional write ports on eah RF: two forthe 64-bit ALU and the load/store unit, and two others for the two 16-bit ALUs. Obviously,this is detrimental for the performane and the power onsumption onsidering the fatthat potentially 20% of the operations in the full-width luster will be ontributing to theinter-luster ommuniations (see Figure 1). The 16-bit dupliate RF shown in Figure 4 hasbeen speially thought to break down this omplexity. This RF provides a opy of thenarrow-width RF and is kept synhronized with it by the funtional units in both lusters.Narrow-width luster implementation details Regarding the narrow-width luster,two write ports are provided by the 16-bit dupliate RF to allow the 16-bit ALUs to keepeah writebak register up to date with their opy in the remote luster. There are tworeasons to maintain the ALUs in the narrow-width luster fully-onneted with the remoteRF opy. First, as shown in Figure 1, the availability of narrow-width operations in programsis large enough to justify the need of more ommuniation bandwidth between the narrow-width luster and the full-width luster. Seond, Figure 1 evidenes the fat that amongthe operations that may potentially involve a remote ommuniation with the narrow-widthluster, the NFF operations are by far the largest. A NFF operation may onsume itsvalue from the 16-bit dupliate RF, meaning that the opy must have been kept up to dateby the narrow-width luster.Full-width luster implementation details In the full-width luster, two read portsand two write ports are provided by the 16-bit dupliate RF to allow the 64-bit ALU and theload/store unit to read and to write bak their result. However, only one write port is atu-ally onneted to the remote RF opy. This latter is motivated by the observation that only asmall fration of the operations exeuting on the full-width luster needs to be synhronizedwith their remote opy. In fat, these operations are restrited to the subset of instrutionsthat produe a 16-bit result, e.g. IF3 and IF1 in Figure 5. As illustrated in Figure 1 withFFN and NFN, their representativeness in programs is negligible; there is therefore noneed to provide the full write bandwidth to keep both opies synhronized. Moreover, ouranalysis showed that among those instrutions that may involve a remote ommuniationwith the narrow-width luster, a large perentage of these are atually narrow-width loads.This explains the additional port on the narrow-width RF whih is write-onneted withthe load/store unit in the full-width luster. The broadast to the remote RF opy is doneeah time the load/store unit writes to the 16-bit dupliate RF.
PI no1742
10 Roheouste, Pokam & SezneSine only the load/store unit maintains both RF opies synhronized with eah other, itis possible that a value being written bak by the 64-bit ALU in the 16-bit dupliate RFis not available in the narrow-width RF when a dependent narrow-width operation is readyto issue. In that ase, we assume the hardware automatially inserts a opy instrution toforward that value to the RF opy [22℄. Note however that this ase is rare sine the onlyoperations that may potentially ommuniate their result to the remote narrow-width lusterare FFN and NFN. These operations ontribute for less than 3% on our benhmarks (seeFigure 1). It is also important to note that only narrow-width operations an be exeutedon the narrow-width luster. The other types of operations ontaining a narrow-width data,i.e. FFN, NNF, NFN and NFF, exeute on the full-width luster and read/write theirnarrow-width data from/to the 16-bit dupliate RF. The 16-bit dupliate RF therefore servesboth as a opy of the narrow-width RF and also a loal 16-bit RF sine many values may beread and written into it without atually modifying their opy. We show indeed in Setion4 that this atually signiantly redues the omplexity.3.2.2 Limited inter-luster onnetivityWe also explored a sheme with limited inter-luster onnetivity to further mitigate overallomplexity. In this new organization, we redue the number of write ports on the 16-bitdupliate RF from 4 to 2. In the narrow-width luster, we remove the path labelled 2 inFigure 4, meaning that only one 16-bit ALU is now able to propagate its result to the remoteRF opy. In the full-width luster, we note that there is no need to provide 2 write portson the 16-bit dupliate RF as maintaining it synhronized with the opy is done by theload/store unit. Moreover, our analysis shows that there are only a few operations (NFNand FFN) exeuting on the main luster that produe narrow-width results. Hene, itmakes sense to remove the path labelled 1 in Figure 4, but operations produing a narrow-width data will now have to be steered toward the load/store unit. Note however that the64-bit ALU an still exeute operations with narrow-width data. If the operation produesa narrow-width result, this result will have to be written bak to the 64-bit RF. Albeit amore eient use of omputing resoures an be realized by doing so, it should be notiedthat we may however miss some optimization opportunities.We also propose to optimize the number of inter-luster ommuniations as a smallfration of the integer operations exeuting on the full-width luster use and produe narrow-width data. For this purpose, we advoate using a opy instrution sheme [22℄ to updatethe ontent of a 16-bit register only when neessary. This approah an lead to signiantpower savings in the interonnet fabri and register les. Nevertheless, using this approahmay also have a negative impat on the overall performane as the onsuming operations willbe delayed until the opy instrutions write their results bak. To mitigate the performanedegradation, we propose to broadast the value of load operations as done in the basi WPM.It makes sense to do so as we observed that load operations whih produe narrow-widthdata are relatively frequent at runtime. However, a more eient optimization would berelated to the use of a narrow-width usage preditor to predit on whih luster a value willlikely be onsumed. This sheme ould be very eetive in both reduing the number ofIrisa
WPM 11ommuniations and improving performane while also eluding the needs of a opy operationsheme. Examining this approah is however left for future researh.4 Complexity AnalysisIn this setion, we aim to ompare the implementation omplexity of the baseline partitionedproessor presented in Setion 3 with the WPM. We onsider two implementations of WPMfor the omparison. The rst one is alled WPM basi. It orresponds to the basi WPMonguration desribed in Setion 3.2.1. The seond implementation is alled WPM limited.It redues the number of write ports on the 16-bit dupliate RF of WPM basi to 2 (seeSetion 3.2.2). The omparison will mainly emphasize the omplexity-eetiveness of thefollowing main proessor strutures: the register le, the bypass network, the wakeup andselet logi, and the interonnet.4.1 Register leThe omplexity of the register le is mainly haraterized by three fators: the area, theaess time and the power onsumption. For the last two points, we based our estimationson CACTI [28℄, whih we modied appropriately to model a register le1. For all the resultspresented in this setion, we assume a 0.13µm CMOS proess tehnology for the registerell implementation.Silion area For a onventional multi-ported register le featuring Nread ports and Nwriteports, a total of Nread bitlines, Nread wordline, 2×Nwrite bitlines along with Nwrite wordlinewires must ross eah memory ell. Equation (1) depits the silion area that is typiallydevoted to a physial register featuring Nregs registers omprised of Rwidth bits eah. In thegiven equation, ω denotes the width of a wire [29℄.
Nregs ×Rwidth × ω
2
× (Nread + Nwrite)× (Nread + 2×Nwrite)
︸ ︷︷ ︸
cell size
(1)Equation (1) shows that the area devoted to the register le is the produt of the numberof registers, Nregs, the number of bit per register, Rwidth, and the size of a memory ell.The area thus inreases linearly with the number of bits and size of the register le, whereasit grows more than quadratially with inreasing number of read/write ports. In WPM,the treatment of narrow-width operations is deoupled from that of the other programinstrutions, yielding a dramati redution of Rwidth. This yields a signiant area redution(about 81%) as shown in Table 1, i.e luster 1 in WPM basi and WPM limited. In thefull-width luster, the number of write ports on the 64-bit register le is halved. The totalarea redution of the RFs in the main luster (luster 0 in Table 1) is about 34% for thebasi WPM and 43% for the limited-onnetivity WPM.1The tag path has been omitted
PI no1742
12 Roheouste, Pokam & SezneConventional WPM basi WPM limitedluster 0 1 0 1 0 1RF 1 1 2 1 2 1reg. width (in bits) 64 64 64/16 16 64/16 16nb of registers 80 80 80/80 80 80/80 80(R,W) ports (4,4) (4,4) (4,2)/(2,4) (4,3) (4,2)/(2,2) (4,3)RF area (x ω2) 491520 491520 245760 + 76800 89600 245760 + 30720 89600Area redution - - 34% 81% 43% 81%RF aess time 0.6326 0.6326 0.6000/0.5278 0.5342 0.6000/0.4916 0.5342Aess time redution - - min(6%,16%) 15% min(6%,22%) 15%RF nJ/aess 0.5431 0.5431 0.4267/0.1977 0.3500 0.4267/0.1571 0.3500Energy redution - - 21%/63% 35% 21%/71% 35%Table 1: Estimate of RF omplexity. Energy onsumption is average energy per read/writeaess.Aess time The aess time of a register le is mainly dominated by the wire propagationdelay. As the size and the area of the register le inrease, signals propagate along long wires,resulting into larger propagation delays. Sine WPM redues the width of the narrow-widthregister le by almost a fator of four, shorter word-lines are required to propagate signals.In the narrow-width luster, this translates into signiant aess time redution omparedwith the onventional proessor, about 15% as evidened in Table 1. In the full-width luster,the register le aess time is dominated by the aess time of the 64-bit register le. Sinethe number of write ports on that register le has been halved, wires length is redued,resulting into smaller aess time (6% less than the baseline model). Therefore, WPM stillproves to be more omplexity-eetive as ompared to the onventional proessor.Power onsumption The register le layout as well as the number of ports attahed toit have a dramati impat on the power onsumption. On eah register le aess, one ormore word-lines go high, while all bit-lines are preharged and sensed in order to determinethe state of the attahed register ells. In WPM, with the bitwidth size redution, onlya small fration of these bit-lines are driven. This yields a signiant energy redution ofthe narrow-width RFs, about 35% as shown in Table 1. The wires length inreases with thenumber of ports, raising the wire apaitane signiantly. With WPM, the number of writeports on the 64-bit RF is halved. Hene, the wire apaitane is redued, explaining theenergy savings of the full-width RF (21% to 71%) as shown in Table 1. As a result, WPMonsumes less energy on a register le aess as ompared with the onventional model sinethe energy per aess is lower in all ases.4.2 Bypass networkBypassing allows the result of an operation to be onsumed by another dependent operationbefore it gets written to the register le. The omplexity of the bypass network is dominatedby the number of bypass paths and the time required for a value to be propagated along
Irisa
WPM 13eah of these paths. In our basi WPM design, any operation on the full-width lusteran have its input operand oming from one of six soures: the 64-bit ALU, the load/storeunit, the two 16-bit ALUs of the narrow-width lusters, the dupliate 16-bit register le andthe 64-bit register le. As a onsequene, operand muxes with a fan-in of 6 are requiredto gate an operand soure to its FU. In the onventional proessor model, a fan-in of 5 isrequired instead. Hene, this design slightly inreases the omplexity of the bypass networkat that point. However, we believe the substantial omplexity redution obtained elsewhere(e.g. register le, interonnet) will likely make up for the slight inrease in area and aesstime due to these muxes. Note however that, with the limited-onnetivity WPM design, thebypass omplexity of both approahes beome equivalent. In addition, the bypass omplexityon the narrow-width luster is always redued sine the orresponding fan-in is at most 4.4.3 Wake-up and selet logiOn a modern supersalar mahine, an operation waits in the issue window for its soureoperands to beome available before being steered to a partiular FU. Assuming a dyadiinstrution with up to n distint produers that may produe a value for eah one of its soureoperands, a total of 2 ∗n omparators must be implemented in the wakeup logi to trak allthese possible wakeup points. With our baseline WPM design, 4 possible wakeup souresmust be monitored on the full-width luster, ompared to 3 on the narrow-width luster.With the onventional proessor, the number of distint wakeup points is 4. These designsare therefore equivalent with a slight advantage to our approah regarding narrow-widthoperations. Note however that with the limited-onnetivity WPM design, the number ofwakeup soures on the full-width luster drops from 4 to 3.4.4 Interonnet fabriThe impat of the interonnet on the area, power and delay is expeted to grow as thedevie features beome smaller. This trend intensies on partitioned miroarhitetures aslong interonnet wires are required to onnet distant lusters. Reent researh in this linereveal that up to 50% of the total dynami power onsumption is due to the interonnets[19℄, while a signiant performane degradation an be attributed to interonnetion delaysas devie features get smaller [27, 12℄. In this setion, we show how WPMs an help to taklethese issues.Interonnet area Figure 6 illustrates the physial layout of a wire. The physial designof wires imposes a minimum spaing between them to mitigate the performane degradationsdue to sidewall apaitane between parallel adjaent wires. The area oupied by wires isproportional to the number of wires, the width of a wire and its length [27℄. In the onven-tional model, the data transfer between lusters involves sending 64-bit of data. Hene, eahinter-luster onnetion requires 64 wires. Using WPM, the number of interonnet wires isredued by four.
PI no1742
14 Roheouste, Pokam & Sezne
Figure 6: Wire layoutInteronnet power The power dissipated by wires [12℄ is generally expressed as P =
a ∗ f ∗ Nwire ∗ C ∗ V
2. In the given equation, a represents the ativity fator on the wire,
f is the wire swithing frequeny, Nwire models the number of wires, C is the wire apa-itane, while V is the voltage swing. Sine WPM redues the value of Nwire by four, theativity fator a and the swithing frequeny f get also redued as fewer wires implies ahigh likelihood of having smaller values of the eetive swithing frequeny a ∗ f . As a on-sequene, ompared to a onventional lustered arhiteture, WPM provides the potentialof signiantly reduing the interonnet power onsumption.Opportunity for redued delays So far, we have assumed working with homogeneousinteronnet wires, i.e. the physial harateristis of the wire shown in Figure 6 were keptthe same throughout this study. However, it is possible to vary these physial harater-istis to redue the interonnet delay or power onsumption. The main idea is to takeadvantage of the area redution obtained with WPMs (see paragraph above) to design wireswith appropriate harateristis that may aelerate the data ommuniation time betweennarrow-width and full-width lusters.
Rwire =
ρ
(Height − barrier)(width − 2 ∗ barrier)
(2)
Cwire = ǫ0(2Kǫhoriz
Height
Spacing
+ 2ǫvert
Width
lspacing
)
+ fringe(ǫhoriz, ǫvert) (3)To see how this may be ahieved, onsider a wire with resistane Rwire and apaitane
Cwire, shown in Equation (2) and Equation (3), respetively [12℄. In the above equations, ρis the material resistivity, H and W represent the height and the width of the wire, b modelsthe thin barrier that prevents opper from diusing into surrounding oxide, the various ǫrepresent the dierent dieletri onstants for vertial and horizontal apaitors, K aountsfor the Miller-eet, while lspacing is the spaing between adjaent metal layers. The delay,
D, at whih data are transfered along wires is proportional to Rwire × Cwire. Irisa
WPM 15To improve the delay, D, it is suient to inrease the wire width and spaing in Rwireand Cwire, respetively. This results into a signiant redution of the delay, but at the ostof inrease area overhead. Our design suits well suh a purpose sine, usually, interonnetwires of less than 20 bits are onsidered for this type of implementation [2℄. In suh ases, theresulting area oupany is somewhat equivalent to that of the onventional arhiteture with64-bit wires optimized for bandwidth, i.e. wires with small width and spaing. In addition,sine the spaing is inreased in Cwire, the apaitane itself gets redued, resulting ina signiant redution of the wire power onsumption. It has reently been shown thatthis tehnique an be inorporated into modern proessors with only marginal inrease inomplexity [2℄. The authors reported a 70% redution of the delay and 16% redution of thedynami power onsumption when ompared with a wire that is optimized for bandwidthas in the onventional proessor ase. Considering the potential redutions of the delay andpower onsumption we just elaborated, we believe that WPMs provide a strong motivationfor the deployment of suh heterogeneous interonnets.5 Instrution Steering MehanismVarious steering shemes [3, 7, 9, 21, 1℄ have been onsidered in the literature for alloatinginstrutions to lusters. Most of these shemes relied upon heuristis that strive to minimizeommuniations and workload imbalane. In our study, we showed that both the amountof ommuniation and the load balaning among lusters are very tight to the availabilityand the distribution of narrow-width operations in programs (see Setion 2). Hene, themain hallenge with WPM is to reveal all the narrow-width instrutions. Several studies[18, 24℄ have pointed out the strong preditability of data width. These studies show thatsimple shemes are apable of ahieving high data-width overage, about 95%. This setiononsiders a simple data-width preditor sheme to unover narrow-width operations. Weshow how this preditor an be integrated into the steering mehanism to speulativelysteer instrutions to the proper luster. Finally, sine a wrong data-width predition leadto an erroneous exeution, we show how a replay mehanism orrets this at runtime.5.1 Data-width preditorThe data-width preditor is used to identify the program instrutions that produe a narrow-width result. The bitwidth of memory operations is also predited. To keep trak of previousdata-width preditions, we maintain an array of 3-bit saturating ounters indexed by theinstrution address. Sine we use the instrution address to index the array, the table lookupan be performed as soon as the instrution address is known and will therefore not lie onthe ritial path. An operation is predited to be narrow-width whenever the ounter issaturated. Otherwise, it is onsidered to be a full-width. In our study, we disriminatebetween two types of data-width mispreditions. A onservative mispredition takes plaewhenever a data-width larger than the eetive data-width is predited. A onservativedoes not aet the exeution and reets the number of optimization opportunities that
PI no1742
16 Roheouste, Pokam & Sezne
Figure 7: 3-bit bimodal data-width mispredition rates. Eah bar orresponds to 4K, 8K,and 16K entries.we might miss. An eetive mispredition ours whenever a data-width is predited witha narrower size than the eetive data-width. In this latter ase, it is neessary to resortto the reovery mehanism desribed in setion 5.3. The saturating ounter is updated asfollows: it is inremented upon enountering a narrow-width operation and reset to zeroupon enountering a full-width operation. The rationale behind doing so is to inrease theauray of narrow-width operations preditions at the ost of missing some optimizationopportunities. Note that adding more hysteresis bits an further redue the number ofeetive mispreditions.Figure 7 illustrates the mispredition rates of this data-width preditor. Table sizes rang-ing from 4K to 16K have been onsidered. The results disriminate between onservativemispreditions and eetive mispreditions. In average, around 2.5% of onservative mispre-ditions and around 0.1% of eetive mispreditions are enountered for a preditor tablefeaturing only 4K entries. Note that a few benhmarks (g, vortex) enounter a signiantonservative mispredition rate, but still exhibit a low eetive mispredition rate (< 0.5%).We observed that these benhmarks might benet from inreasing the preditor table size.5.2 Speulative instrution steeringOur steering mehanism assigns lusters to instrutions being renamed aording to theloation of the soure operands along with the deision of the data-width preditor. Thissimple heuristi onsiderably simplies the steering logi. It is possible to onsider moreompliated heuristis based upon dependeny hain information among operations so thatperformane is maximized. Using suh approah, however, would likely plae the steeringlogi in the ritial path as more logi will be needed. Note that as there exists only oneload/store unit, memory operations are sheduled on the main luster. We assume that the
Irisa
WPM 17renaming proess is aware of the register le aliation to lusters, this an be done throughan expliit numbering of the physial register addresses, e.g. odd/even numbering.The steering of instrutions to lusters proeeds as follows. Let us onsider I, the instru-tion to be steered. Depending upon the physial loation of I's soure operands, two asesmay our:
• All the soure operands of I reside in the narrow-width RF (16-bit). In this ase, if thedata-width preditor outome indiates a 16-bit data-width, I has to be dispathed tothe narrow-width luster. Otherwise, I will be assigned to the full-width luster.
• Any soure operands of I reside in the full-width RF (64-bit). In this ase, we make theonservative deision to dispath I to the main (full-width) luster. If the data-widthpreditor outome indiates a 16-bit data-width, I will produe its operand on the16-bit dupliate RF. Otherwise, the result of I will have to be written bak onto the64-bit RF.5.3 Reovery mehanismIn our study, we assume that a mispredition is deteted at the exeution stage by a zerodetetion logi whih is available in many urrent implementations [4℄. Whenever an eetivemispredition ours, a replay trap must be triggered by the hardware. In this ase, thepipeline in eah luster must be ushed to prevent from any potential resoure deadloks.Predition tables are updated and instrutions are then reassigned aordingly. In ourontext, using the reovery mehanism is similar in ost to a branh mispredition reovery.Furthermore, both strutures involve the same operations. Hene, the logi required forthe reovery mehanism an be shared with the data-width preditor so that the hardwareomplexity is made negligible.6 WPM EvaluationIn the previous setions, we argued that WPM redues the omplexity of a onventionallustered proessor. In this setion, we present an evaluation of WPM, showing how itompares with the baseline model.6.1 Simulation methodologyFor our experiments, we used a modied version of the MASE miroarhitetural simula-tor whih is based on SimpleSalar [16℄. In partiular, MASE was modied to model thelustering of integer FUs and the dupliation of integer RFs. The modiations take intoaount the ontentions on the luster interonnet, the issue queues, the physial registerles and the register renaming. We also model the bimodal data-width preditor along withthe data-width reovery mehanism. Our baseline miroarhiteture is derived after thePI no1742
18 Roheouste, Pokam & SezneParameter CongurationFeth queue 16Branh preditor bimodalData-width preditor 4k 3-bit bimodalFeth width 4Issue width 2 per lusterDeode width 4Retire width 4Issue queue 16 per lusterFUs (64-bit) ALU + LD/STFUs (16-bit) 2 ALUsRegister le 80ROB 64LSQ 16homog. interonnet delay 2 ylesheter. interonnet delay 1 yle64-bit interonnet power 1.016-bit interonnet power 0.84Table 2: Simulated mahine parameters.
MediaBenhBenhmark Inputepi test_image.pgm [.pgm.E℄g721 linton.g721 [.pm℄ghostsript test.ppmjpeg testout.ppm [.jpeg℄mesa -mpeg2 mei16v2.m2v out.m2vpegwit pegwit.en [.de℄SPECInt2000Benhmark Inputbzip2 input.graphig silab.igzip input.randommf inp.inparser ref.invortex lendian1.rawvpr plae.inTable 3: Benhmark appliations.Alpha 21264 [13℄. Table 2 summarizes the main mahine parameters assumed for the rest ofthis study. The relative delay estimates as well as the relative power onsumption values forthe homogeneous and the heterogeneous interonnets are diretly derived from [2℄. Notethat the proessing of oating-point operations is done in a separate luster as in the Alpha21264. Two WPM ongurations are onsidered for omparison with the baseline proessorintrodued in Setion 3.1. These are the basi WPM desribed in Setion 3.2.1 and thelimited-onnetivity WPM disussed in Setion 3.2.2.We onduted our evaluation with several benhmarks olleted from MediaBenh andSPEC2000. All appliations were ompiled with the PISA g ompiler using -02 and -funroll-loops optimization ags. Table 3 presents the benhmarks along with the input datasets used for olleting the performane numbers. The appliations whih exeute with fewerthan 300 millions instrutions were run until ompletion. For the others, we fast-forwardedpast 100 million instrutions and simulated over 200 millions instrutions.6.2 Workload balaneWorkload balane is a ritial fator for performane in a lustered miroarhiteture. If theharge on a luster is unbalaned with respet to another, performane may be signiantlyimpaired sine one luster might be overloaded while the other might be idle for most of theexeution. We estimated the workload balane in WPM as the dierene in the number ofready instrutions in eah luster at eah yle [7℄, i.e. a zero dierene identies a perfetbalane senario, a dierene of one means one luster has one more instrution than theother, et. The results presented in Figure 8 onsider the workload balane distribution ofour basi WPM implementation featuring a 4k bimodal data-width preditor. It is shownIrisa
WPM 19
Figure 8: Workload balane distribution for MediaBenh (left) and SPEC2k (right) benh-marks.that, on average, the lusters workload is very well-balaned for about 50% of the programexeution ([0-2℄ dierenes) whereas it is reasonably balaned for 80% of the exeution ([0-5℄dierenes). Considering the asymmetri nature of our WPM implementation, these resultsappear to be yet very promising. In addition, we only onsider a simplied steering heuristiwhih does not rely upon any workload information. It may therefore be feasible to improvethe WPM steering mehanism but at the ost of missing some narrow-width optimizations.6.3 Performane impatWe onsidered the balaned RBMS steering heuristi [7℄ to assign lusters to instrutions inthe baseline miroarhiteture. The balaned RBMS heuristi tries to minimize the numberof ommuniations by steering dependent operations to the same luster while also takinginto onsideration the harge of the lusters. For a onventional lustered miroarhiteture,the performane degradation - measured in terms of overall IPC - primarily depends uponthe workload distribution and the number of inter-luster ommuniations. For a WPM,data-width mispreditions may further aet performane as additional yles are needed forreovering to a orret state. To exhibit the eetiveness of our proposal without aountingfor the impat of mispreditions, we implemented a basi WPM featuring an orale data-width preditor. Figure 9 shows the performane degradation with the baseline model fordierent WPM ongurations.As shown in Figure 9, the average performane of the basi WPM with an orale bitwidthpreditor is very lose to that of the onventional lustered model that uses a omplexsteering mehanism. We an, however, notie that some appliations (epi, mpeg2_deode,unepi) perform better on the basi WPM featuring an orale width-preditor than on thebaseline model. We observed that this is due to the fat that the workload is very unbal-aned on these appliations when onsidering the baseline steering heuristi. Considering aPI no1742
20 Roheouste, Pokam & Sezne
Figure 9: IPC variation.realisti data-width preditor adds of ourse some overhead. Figure 9 shows indeed that theperformane is degraded by about 6% on average for the basi WPM with a bimodal data-width preditor. First, this degradation aounts for the data-width mispreditions and theost of the replay. On the other hand, this degradation also depends upon the distribution ofthe narrow-width operations whih is determined by the data-width preditor. By referringto Figure 7, we an observe that some benhmarks (g, mf, vortex) exhibit a high numberof onservative mispreditions. However, this doesn't aet the performane sine only theeetive mispreditions are driving the replay, whih really is the performane bottlenek.In addition, the relative good workload balane of these appliations also ontributes tokeep this overhead low. The additional performane degradation observed with the limitedWPM sheme is prinipally due to the opy instrutions that need to be sheduled eah timean operation must onsume an operand value that is only available remotely. These opyoperations are meant to synhronize a loal RF with its remote opy. The lateny of theopy operation is therefore equal the delay of the interonnet. As a result, the dependentoperation must stall that long until the operand value is available loally. Figure 9 showsthat this may have a detrimental impat on performane, with an average slowdown of al-most 13% observed on the benhmarks. However, as noted earlier (see Setion 3.2.2), byonsidering a data-width usage preditor, opy instrutions may be issued speulatively be-fore use, just after an operation is issued that may produe a narrow-width value onsumedremotely. This approah would be very eetive to mitigate the performane degradationobserved with the limited WPM sheme. Finally, we also onsidered an implementation withan heterogeneous interonnet whih is able to propagate the data twie as faster than in thebaseline lustered model. With this sheme, Figure 9 shows that an average performaneimprovement of 6% is observed for the basi WPM and a performane degradation of 1%is obtained for the limited WPM as ompared to the baseline lustered model. Note thatthese results onsider a very onservative heterogeneous interonnet delay. For instane, as
Irisa
WPM 21
Figure 10: Power savings breakdown (ALU, RF, interonnet (IC)).for omparison, [2℄ onsiders that the wire delay of the heterogeneous interonnet an beredued by a fator of 3.6.4 Power redutionWith respet to a onventional supersalar proessor, a lustered arhiteture an ahievesigniant energy savings due to the derease in omplexity of various ritial proessor om-ponents. Our WPM model benets from these energy savings while further reduing powerdissipation in most of the datapath omponents. In the rst plae, the power onsumptionof the funtional units designed to treat the narrow-width operations dereases by a linearfator as noted in [24℄. As disussed in setion 4.1, the energy onsumption of the registerles is dramatially lowered due to a derease in the number of aess ports and the width ofregisters. As a result of exlusively ommuniating narrow-width values, energy dissipatedin the interonnet fabri is also redued in a signiant way. Energy savings in the latterstruture result from both the redution in the number of wires used and the infrequentourrene of inter-luster ommuniations. Note that WPM may still involves additionalpower overhead due to resorting to a speulative sheme and the use of extra multiplexorsin the bypass network. For minimizing the power impat of the data-width preditor, weonsidered a relatively small history table whih dissipates a tolerable amount of power [23℄.Overall, we believe that the resulting power overhead is negligible ompared to the energysavings obtained by the other WPM omponents.Figure 10 reports the energy savings realized by the basi and the limited-onnetivityWPM implementations as ompared to the baseline model. The rst bar represents theenergy gain realized by the funtional units. On average, it an be seen that using thebasi WPM implementation yields a 20% energy redution, whereas onsidering the limited-onnetivity WPM up to 13% of the ALU energy an be saved. This dierene stems fromthe fat that the limited-onnetivity model soliits more the use of funtional units forproessing the opy instrutions. One may therefore optimize the energy onsumption of thePI no1742
22 Roheouste, Pokam & Seznelimited-onnetivity model by rening the steering heuristi so that it minimizes the inter-luster ommuniations. Next, we an observe that up to 50% of the RF energy onsumptionis saved with both the basi and the limited WPM implementations. The dierene in theenergy savings is not signiant as the treatment of the opy instrutions in the limited WPMsheme onsumes extra energy when the operand value that needs to be ommuniated isaessed. Power savings realized in the interonnet fabri are the most signiant in ourapproah. Indeed, up to 60% of the energy is saved with the basi WPM implementation,while 80% of energy redution is ahieved with the limited-onnetivity model. Note thatwe did not onsidered lower-power wires that an further redue the energy dissipation by afator of 3 [2℄. Considering that modern miroproessors dissipate a large amount of powerin the interonnet fabri (50% in the Pentium 4 [19℄), our WPM proposal might havetherefore a signiant impat on the overall miroproessor power onsumption. It shouldbe noted that employing a WPM also helps reduing the leakage power in the data-pathomponents. For instane, Balasubramonian et al. [2℄ denote that the stati power in theinteronnet fabri an be redued by a fator ranging from 1.26 to 3 when using a smallnumber of wires. Using narrow-width strutures also helps takling the stati energy of theRF and FUs.7 Related WorkSeveral tehniques have been proposed to takle the omplexity growth assoiated withsaling up the performane of modern supersalar proessors. One approah, partition-ing, onsists of arranging the resoures of a proessor into lusters. The other approahonsidered in this paper onsists of takling the omplexity growth problem by exploitingnarrow-width data. Eah of these approahes are examined in the next setions.7.1 Partitioned miroarhiteturesIn a partitioned miroarhiteture, the ritial proessor omponents are arranged intosmaller omputational units, alled lusters. A luster represents the omplexity-eetiveounterpart of a entralized design; it an therefore be amenable to sustain higher lokrates. In addition, a lustered miroarhiteture an sale to larger issue-width sine theparallelism an be distributed aross the lusters. Several researh papers disussed vari-ants of this type of arhiteture [21, 9, 1℄. Unlike a entralized design, data produed onone luster may be ommuniated to another luster using an interonnet fabri. Sinethe lateny of the interonnet fabri is higher than the intra-luser lateny, the instrutionsteering logi should try to minimize the inter-luster ommuniations, while at the sametime balaning the workload among the lusters. Multiple heuristis for the instrutionsteering logi have been proposed in the literature [3, 7, 9, 21, 1℄. This paper ontrasts withthese previous works by onsidering a new heuristi for the instrution steering logi basedon narrow-width data whih also proves to balane the workload among lusters.
Irisa
WPM 23Attempts to exploit partitioning to redue the omplexity of the register le inlude theAlpha 21264 [13℄. The Alpha 21264 provides eah luster with a opy of the register le(RF). This approah redues the number of read ports, but requires eah RF opy to havethe same number of write ports as there are funtional units. Sezne et al. [26℄ improvedthis approah by reduing the number of write ports through write speialization. Ourapproah onsiders a dierent lustering motivation to takle proessor omplexity, i.e. thenarrow-width operations property of programs, but still an take advantage of this tehniqueto further redue the omplexity.In an independent study, Gonzalez et al. [11℄ reently examined a partitioned designwhih shares some similarities with our proposal, as it is also based on the narrow-widthproperty of programs. However, while we exploit narrow width to redue the omplexityof some ritial datapath omponents and to redue power onsumption, Gonzalez et al[11℄ foused on a performane-oriented design. An asymmetri proessor organization thatdistributes the exeution ore among two lusters is proposed. It features a regular 64-bit luster and a narrow 20-bit luster with limited resoures but running at twie thelok frequeny. For this design, it is therefore desirable to steer most of the programinstrutions towards the narrow luster for enabling performane gains. A onventionaldata-width preditor is used for that purpose. A rst dierene with our approah is thataddress omputations with invariant high-order bits are also onsidered; thereby enablingabout 75% of the instrutions to be exeuted on the narrow luster but at the ost of asigniant workload imbalane. Despite this partitioned design is shown to slightly improveperformane, it is muh more diult to assert its impats on the proessor omplexity.Doubling the narrow ore frequeny also involve to double the frequeny of other omponentlogis (e.g. wake-up and issue logi) and leads to a orresponding inrease of the poweronsumption. Many omplex artefats are also required, as for instane the replay logi inase of width mispredition and the TLB mehanisms. No omplexity nor power analysiswere disussed throughout the paper. Unlike their study, we provide a detailed analysis ofthe proessor omplexity fators whih demonstrate the feasibility of the WPM model. Ourwork also introdues many unique features (RF dupliate) whih have been shown to furtherredue the proessor omplexity with only little impats on performane.7.2 Exploiting narrow-width operandsThe observation that most of the appliations an be exuted by a narrow datapathwas dueto Brooks et al. [4℄. They oined as narrow-width operands data that an be representedwith less than 16-bits datapath. This setion is onerned with optimization shemes thatdiretly make use of narrow-width operands.7.2.1 Narrow-width optimizationsA miroarhiteture whih is not aware of the narrow-width data property would exerisethe full datapath-width upon eah instrution exeution, irrespetive of the size of the data.Hene, one approah to make eiently use of the narrow-width data is to optimize aPI no1742
24 Roheouste, Pokam & Sezneproessor for power-eieny, reduing the eetive number of transitions that takes plaeon the datapath. Implementations of this approah inlude [4, 5, 6, 24℄. Other approaheshave onsidered instead using the empty bitwidth slies on the datapath to inrease theeetive issue width by allowing several narrow-width data to share the datapath [25, 18, 20℄.7.2.2 Register le optimizationsThere are two major ontributors to the register le omplexity: the aess time, whihis strongly orrelated with the number of physial registers, and the area/power whih islargely determined by the number of available read/write ports. This setion is onernedwith some of the reent proposals that exploit narrow-width data to takle these issues.Lipasti et al. [17℄ addressed the ase of reduing the pressure on the register le by makingeetive use of the available physial registers. The idea is based on the observation that thetime between the last read and the release dominates a physial register lifetime, whereasmostly only few bits are required to represent the data values stored in these registers.Hene, the authors proposed to early freeing up registers ontaining narrow-width data bystoring their ontent in the ID eld of the register map table. The range of narrow-widthvalues that an be overed by this sheme is therefore strongly dependent on the bit-widthof the ID eld. For a typial register le of size 64, only 8-bits index would be availablein the register map table. To address a larger range of narrow-width values, the size ofthe map table would have to be saled aordingly. It is obvious that this is not withoutonsequenes on the miroarhiteture.Ergin et al. [8℄ proposed to exploit narrow-width data by means of register paking.Similar to the SIMD programming model [15℄, the authors propose to pak several narrow-width data into a single register, making eetive use of available registers; thus reduingthe pressure on the register le. For a onventional 64-bit register le, for instane, eahregister is divided into four partitions of size 16-bit eah. A narrow-width value is allowed tobe represented with any partition ombination. This sheme has the potential to ompliatethe miroarhiteture (e.g. the register read stage) as a narrow-width value may now oupyany partition ombination inside a register.Pokam et al. [24℄ proposed the byte-slie register le to redue the energy onsumptionof a register le. A 32-bit onventional register le is partitioned into three slies of size 8-,8-, and 16-bit eah. A data an be plaed into the lowest 8-bit slie or into the rst two 8-bitslies to represent a narrow-width value of size 8-bit or 16-bit, respetively. When operatingin one of these narrow-width modes, the unused upper slies of the register le are put intoa drowsy-mode [10℄ to save stati energy. This sheme requires signiant modiations tothe memory ells.The proposal by Kondo et al. [14℄ is somewhat similar to [24℄ and [8℄. They presented adetailed implementation of a bit-partition register le that takes advantage of narrow-widthdata by dividing a onventional register le into bit-partitions of equal size. Eah suh bit-partition an hold a dierent narrow-width data. Hene, this approah is somewhat similarin omplexity to [8℄.
Irisa
WPM 25We believe it is not always obvious to assess the impat of these various proposals onthe miroarhiteture. The register le is at the heart of the proessor performane, suhthat any hange it undergoes is likely to have a detrimental eet on the yle time. On theother hand, lustered arhitetures have already demonstrated their potential to redue theomplexity growth. Our approah thus naturally ombines these two proposals and proves,in eet, to be very eetive to eliminate most of the overhead found in previous work.8 ConlusionsUsing 64-bit ISAs in general-purpose omputing (PCs, servers, ..) has beome mainstream.Therefore datapaths on urrent proessors are 64-bit wide. However, the analysis of work-loads show that appliations are also ontaining a very signiant proportion of narrow-widthoperands. Moreover, the use of these narrow-width operands is often evenly distributedover the overall exeution. To address this issue, we introdued a new design, alled width-partitioned miroarhitetures (WPM), to help master the hardware omplexity of super-salar proessors. Through featuring a full-width luster and a narrow-width luster, WPMexploits the natural distribution of the narrow-width and the larger data-width operationsfound in programs to balane the workload among the lusters.We showed that suh a partitioning approah greatly redues the omplexity of existingmiroarhitetures. We showed indeed that WPM signiantly redues the area and thepower overhead of the register le and the interonnet fabri, giving thus rise to moreaggressive implementations. In addition, we also demonstrated that WPM allows to breakdown the omplexity of several ritial proessor strutures inluding the register le, thewakeup and the selet logi, and the bypass network. Overall, the performane of WPMsare very promising. Our evaluation showed that, using a WPM arhiteture instead of alassial 64-bit 2-luster arhiteture, more than 50% of the power onsumption an be savedon the register le and the interonnet fabri with only a performane overhead of less than6%. Moreover using narrow-width may allow to use more aggressive luster interonnetionimplementations and may even result in performane improvement as illustrated in Setion6.3.Further researh is needed to investigate ways to improve WPMs. For instane, thesensitivity to the bitwidth preditor has not been studied in depth; although it is very likelythat more aggressive bitwidth preditors will improve the apability of WPM. In addition,it would be equally interesting to study the salability of WPM. In partiular, an interestingquestion is how should the narrow-width lusters sale with inreasing issue width? Onthe other hand, this study has only onsidered the integer funtional units for purpose ofsimpliity. However, it is very likely that other strutures may also benet from WPM aswell. Potential strutures like these whih may be worth to look at inlude the data ahe.Indeed, sine the load-store unit an be distributed among lusters, it will be interesting toinvestigate the issues of partitioning the data ahe along with the narrow-width luster.
PI no1742
26 Roheouste, Pokam & SezneReferenes[1℄ Balasubramonian, R., Dwarkadas, S., Albonesi, D. Dynamially Managing the Communiation-parallelism Trade-o in Future Clustered Proessors. In Proeedings of the 30th International Sympo-sium on Computer Arhiteture, June 2003.[2℄ Balasubramonian, R., Muralimanohar, N., Ramani, K., and Venkatahalapathy, V. MiroarhiteturalWire Management for Performane and Power in Partitioned Arhitetures . In Proeedings of the 5thInternational Symposium on High-Performane Computer Arhiteture, February 2005.[3℄ Baniasadi, A., and Moshovos, A. Instrution Distribution Heuristis for Quad-Cluster,Dynamially-Sheduled, Supersalar Proessors. In Proeedings of the 33th International Symposium on Miroar-hiteture, De. 2000.[4℄ Brooks, D., and Martonosi, M. Dynamially Exploiting Narrow Width Operands to Improve ProessorPower and Performane. In Proeedings of the 5th International Symposium on High-PerformaneComputer Arhiteture, January 1999.[5℄ Canal, R., Gonzales, A., and Smith, J. E. Very Low Power Pipelines Using Signiane Compression.In Proeedings of the 33th International Symposium on Miroarhiteture, Deember 2000.[6℄ Canal, R., Gonzales, A., and Smith, J.E. Software-Controlled Operand-Gating. In Proeedings of theInternational Symposium on Code Generation and Optimization, Marh 2004.[7℄ Canal, R., Parerisa, J-M., Gonzalez, A. Dynami Cluster Assignment Mehanisms. In Proeedings ofthe 6th International Symposium on High-Performane Computer Arhiteture, Jan. 2000.[8℄ Ergin, O., Balkan, D., Ghose, K., and Ponomarev, D. Register Paking: Exploiting Narrow-WidthOperands for Reduing Register File Pressure. In Proeedings of the 37th International Symposium onMiroarhiteture, Deember 2004.[9℄ Farkas, K. I., Chow, P., Jouppi, N. P., and Vranesi, Z. The Multiluster Arhiteture: Reduing CyleTime Through Partitioning. In Proeedings of the 30th International Symposium on Miroarhiteture,Deember 1997.[10℄ Flautner, K., Sung Kim, N., Martin, S., Blaauw, D., and Mudge, T. Drowsy Cahes: Simple Teh-niques for Reduing Leakage Power. In Proeedings of the 29th International Symposium on ComputerArhiteture, May 2002.[11℄ Gonzalez, R., Cristal, A., Veidenbaum, A., Perias, M. and Valero, M. An Asymmetri ClusteredProessor based on Value Content. In Proeedings of the 19th ACM International Conferene onSuperomputing, June 2005.[12℄ Ho, R., Mai, K. W., and Horowitz, M. A. The Future of Wires. Proeedings of the IEEE, 89(4):490504,Apr. 2001.[13℄ R. Kessler. The Alpha 21264 Miroproessor. IEEE Miro, 19(2):2436, Marh 1999.[14℄ Kondo, M., and Nakamura, H. A Small, Fast and Low-Power Register File by Bit-Partitioning. In Pro-eedings of the 11th International Symposium on High-Performane Computer Arhiteture, February2005.[15℄ Larsen, S., and Amarasinghe, S. Exploiting Superword Level Parallelism with Multimedia Instru-tion Sets. In Proeedings of the ACM SIGPLAN Conferene on Programming Language Design andImplementation, June 2000.[16℄ Larson, E., Chatterjee, S., and Austin, T. Mase: A novel infrastruture for detailed miroarhiteturalmodeling. In Proeedings of the 2001 International Symposium on Performane Analysis of Systemsand Software, November 2001.[17℄ Lipasti, M. H., Mestan, B. R., and Gunadi, E. Physial Register Inlining. In Proeedings of the 31thInternational Symposium on Computer Arhiteture, June 2004.[18℄ Loh, G. Exploiting Data-Width Loality to Inrease Supersalar Exeution Bandwidth. In Proeedingsof the 35th International Symposium on Miroarhiteture, November 2002. Irisa
WPM 27[19℄ Magen, N., Kolodny, A., Weiser, U., and Shamir, N. Interonnet-power Dissipation in a Miroproessor.In Proeedings of the 2004 International Workshop on System Level Interonnet Predition, 2004.[20℄ Nakra, T., Childers, B.R., and Soa, M.L. Width-Sensitive Sheduling for Resoure-Constrained VLIWProessors. In Proeedings of the 3th ACM Workshop on Feedbak-Direted and Dynami Optimization,Deember 2000.[21℄ Palaharla, S., Jouppi, N. P., and Smith, J. E. Complexity-eetive Supersalar Proessors. ACMSIGARCH Computer Arhiteture News, 25(2):206  218, May 1997.[22℄ Parerisa, J-M., Sahuquillo, J., Gonzalez, A., and Duato, J. Eient Interonnets for Clustered Mi-roarhitetures. In Proeedings of the International Conferene on Parallel Arhitetures and Compi-lation Tehniques, Sept. 2002.[23℄ Parikh, D., Skadron, K., Zhang Y., Barella, M., and Stan, M.R. Power issues related to branh predi-tion. In Proeedings of the 8th International Symposium on High-Performane Computer Arhiteture,2002.[24℄ Pokam, G., Roheouste, O., Sezne, A., and Bodin, F. Speulative Software Management of Datapath-width for Energy Optimization. In Proeedings of the 2004 ACM SIGPLAN/SIGBED Conferene onLanguages, Compilers, and Tools for Embedded Systems, June 2004.[25℄ Sato, T., and Arita, I. Table Size Redution for Data Value Preditors by Exploiting Narrow WidthValues. In Proeedings of the 14th international onferene on Superomputing, May 2000.[26℄ Sezne, A., Toulle, E., and Roheouste, O. Register Write Speialization Register Read Speializa-tion: A Path to Complexity-eetive Wide-Issue Supersalar Proessors. In Proeedings of the 30thInternational Symposium on Miroarhiteture, Deember 2002.[27℄ Theis, T. N. The Future of Interonnetion Tehnology. IBM Journal of Researh and Development,44(3), 2000.[28℄ Wilton, J. E., and Jouppi, N. P. CACTI: An Enhaned Cahe Aess and Cyle Time Model. IEEEJournal of Solid-State Ciruits, 31(5):677688, May 1996.[29℄ Zyuban, V., and Kogge, P. The Energy Complexity of Register Files. In Proeedings of the InternationalSymposium on Low Power Eletronis and Designs, Aug. 1998.
PI no1742
