High-Performance Matrix Multiplication: Hierarchical Data Structures, Optimized Kernel Routines, and Qualitative Performance Modeling by Wu, Wenhao
Mississippi State University 
Scholars Junction 
Theses and Dissertations Theses and Dissertations 
8-2-2003 
High-Performance Matrix Multiplication: Hierarchical Data 
Structures, Optimized Kernel Routines, and Qualitative 
Performance Modeling 
Wenhao Wu 
Follow this and additional works at: https://scholarsjunction.msstate.edu/td 
Recommended Citation 
Wu, Wenhao, "High-Performance Matrix Multiplication: Hierarchical Data Structures, Optimized Kernel 
Routines, and Qualitative Performance Modeling" (2003). Theses and Dissertations. 2495. 
https://scholarsjunction.msstate.edu/td/2495 
This Graduate Thesis - Open Access is brought to you for free and open access by the Theses and Dissertations at 
Scholars Junction. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of 
Scholars Junction. For more information, please contact scholcomm@msstate.libanswers.com. 
HIGH-PERFORMANCEMATRIX MULTIPLICATION: HIERARCHICAL DATA







in PartialFulfillment of theRequirements








HIGH-PERFORMANCEMATRIX MULTIPLICATION: HIERARCHICAL DATA
























Major Professor:Dr. Anthony Skjellum
Title of Study:HIGH-PERFORMANCE MATRIX MULTIPLICATION: HIERAR-
CHICAL DATA STRUCTURES,OPTIMIZED KERNEL ROUTINES,
AND QUALITATIVE PERFORMANCEMODELING
Pagesin Study:69
Candidatefor Degreeof Masterof Science
The optimal implementationof matrix multiplication on moderncomputerarchitec-
turesis of greatimportancefor scientificandengineeringapplications.However, achiev-
ing the optimal performancefor matrix multiplication hasbeencontinuouslychallenged
both by the ever-wideningperformancegapbetweenthe processorandmemoryhierar-
chy andthe introductionof new architecturalfeaturesin modernarchitectures.Thecon-
ventionalway of dealingwith thesechallengesbenefitssignificantly from the blocking
algorithm,which improvesthe datalocality in the cachememory, and from the highly
tunedinnerkernelroutines,which in turn exploit thearchitecturalaspectson thespecific
processorto deliver nearpeakperformance.A state-of-artimprovementof theblocking
algorithmis theself-tuningapproachthatutilizes“heroic” combinatorialoptimizationof
parameterspaces.Otherrecentresearchapproachesincludetheapproachthatexplicitly
blocksfor theTLB (TranslationLookasideBuffer) andthehierarchicalformulationthat
employsmemory-friendlyMorton Ordering(a space-fillingcurvemethodology).
This thesiscomparesandcontraststheTLB-blocking-basedandMorton-Order-based
methodsfor densematrix multiplication, and offers a qualitative model to explain the
performancebehavior. Comparisonsto the performanceof self-tuninglibrary and the
“vendor” library are also offered for the Alpha architecture. The practicalbenchmark
experimentsdemonstratethat neitherconventionalblocking-basedimplementationsnor
theself-tuninglibrariesareoptimalto achieveconsistenthighperformancein densematrix
multiplicationof relatively largesquarematrix size. Instead,architecturalconstraintsand
issuesevidently restrict the critical pathandoptionsavailable for optimal performance,
so that the relatively simplestrategy andframework presentedin this studyoffershigher
and flatter overall performance.Interestingly, maximal inner kernel efficiency is not a
guaranteeof global minimal multiplication time. Also, efficient andflat performanceis






Theauthorwouldliketo expresshissincerethanksto hismajorprofessor, Dr. Anthony
Skjellum,for his invaluablesupportandguidancefor this researchwork. Thanksarealso
due to his committeemembersDr. DonnaReeseand Dr. Edward Luke. The author
would alsolike to thankfor his collaborators,includingMr. KazushigeGoto,Mr. Vinod
Valsalam,andDr. RobertA. Van de Geijn at Universityof Texasat Austin. Theauthor
alsoowesspecialthanksto the colleaguesin High PerformanceComputingLaboratory
lab and MPI Software Technology, Inc for making both placesenjoyable and exciting
researchenvironments,andMPI SoftwareTechnology, Inc andHewlett-Packardfor kindly
providing the testingfacilities. Finally, theauthorwould like to thankMs. Neli Fairfield




DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
ACKNOWLEDGMENTS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
LIST OFTABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OFFIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
CHAPTER
I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Hypothesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Organization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
II. LITERATUREREVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Matrix Multiplication andLinearAlgebraLibraries . . . . . . . . . . . 7
2.1.1 BasicConceptsandAlgorithms . . . . . . . . . . . . . . . . . . 7
2.1.2 Matrix Multiplication in theLinearAlgebraLibraries . . . . . . 9
2.2 ArchitectureStudy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 ProcessorMicroarchitectureandThePeakPerformance. . . . . 10
2.2.2 MemoryHierarchyandDeliveredPerformance . . . . . . . . . 11
2.3 PreviousMatrix Multiplication Work . . . . . . . . . . . . . . . . . . . 13
2.3.1 Early Stageof theBlocking Algorithm . . . . . . . . . . . . . . 14
2.3.2 Self-TuningLibraries . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.3 Cache-ObliviousAlgorithm . . . . . . . . . . . . . . . . . . . . 17
2.3.4 HierarchicalDataStructures . . . . . . . . . . . . . . . . . . . 18
2.3.5 Variantsof New BlockingApproach . . . . . . . . . . . . . . . 19
2.4 Lessonslearned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
iv
CHAPTER Page
III. RESEARCHAPPROACH . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 TLB-ObviousArchitecturalModeling . . . . . . . . . . . . . . . . . . 21
3.1.1 Review of MemoryAccessProcess. . . . . . . . . . . . . . . . 21
3.1.2 Modelingof BlockingAlgorithm . . . . . . . . . . . . . . . . . 23
3.1.3 Blocking for TLB . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 TheMatrix Multiplication Framework . . . . . . . . . . . . . . . . . . 27
3.2.1 HierarchicalDataStructure . . . . . . . . . . . . . . . . . . . . 28
3.2.2 IterativeAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.3 OptimizedKernelStrategy for Alpha . . . . . . . . . . . . . . . 32
IV. PERFORMANCEINTEGRATION, TUNING, AND BENCHMARKING . . 37
4.1 IntegrationandTuningStrategy . . . . . . . . . . . . . . . . . . . . . . 37
4.2 UserLevel PerformanceMetrics . . . . . . . . . . . . . . . . . . . . . 39
4.2.1 In-CachePerformance. . . . . . . . . . . . . . . . . . . . . . . 40
4.2.2 Out-of-CachePerformanceof Matrix Multiplication . . . . . . . 41
V. EXPERIMENTS,RESULTS,AND ANALYSIS . . . . . . . . . . . . . . . . 44
5.1 NotationsandExperimentalConfiguration . . . . . . . . . . . . . . . . 44
5.2 Performanceof InnerKernels. . . . . . . . . . . . . . . . . . . . . . . 47
5.2.1 PeakKernelvsAggregateKernel . . . . . . . . . . . . . . . . . 48
5.2.2 HPCLvs ATLAS . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Performanceof matrixmultiplication . . . . . . . . . . . . . . . . . . . 51
5.3.1 Explicit TLB blockingvs Implicit MemoryOptimization . . . . 51
5.3.2 DeliveredPerformance . . . . . . . . . . . . . . . . . . . . . . 54
VI. CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
APPENDIX




2.1 TheCharacteristicsof x86ProcessorMicroarchitecture . . . . . . . . . . . . 12
2.2 MemoryHierarchyParametersof Currentx86Architecture . . . . . . . . . . 14
4.1 UserLevel PerformanceMetrics . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1 Notationsof Dif ferentMethods. . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 ExperimentalConfigurations(Adaptedfrom [20]) . . . . . . . . . . . . . . . 46




2.1 An Exampleof Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . 7
2.2 Pseudocodefor Standard
 
Algorithm . . . . . . . . . . . . . . . . . . . . 8
2.3 ThePyramidof MemoryHierarchy. . . . . . . . . . . . . . . . . . . . . . . 13
3.1 TheDiagramof MemoryAccessProcess. . . . . . . . . . . . . . . . . . . . 22
3.2 Pseudocodefor the
 
BlockingAlgorithm . . . . . . . . . . . . . . . . . . 25
3.3 A HierarchicallyStored
	 Matrix (Adoptedfrom [43]) . . . . . . . . . 29
3.4 Pseudocodefor the
 
iterativealgorithm(Adaptedfrom [43]) . . . . . . . . 31
3.5 SampleAssemblyCodeI of SoftwarePipelining(Adoptedfrom [8]) . . . . . 35
3.6 SampleAssemblyCodeII of SoftwarePipelining(Adoptedfrom [8]) . . . . 35
4.1 PerformanceFlatnessRegardingto TheBlockingSize. . . . . . . . . . . . . 39
5.1 Performanceof PeakKernelandAggregateKernel . . . . . . . . . . . . . . 49
5.2 TheIn-CachePerformanceof HPCLandATLAS . . . . . . . . . . . . . . . 50
5.3 TheOut-of-CachePerformanceof GOTO andHPCL . . . . . . . . . . . . . 52
5.4 Performanceof VariousApproachesonEV67-Tru64 . . . . . . . . . . . . . 56
5.5 Powerof 2 Performanceof VariousApproacheson EV67-Tru64 . . . . . . . 57
5.6 Performanceof VariousApproachesonEV6-Linux . . . . . . . . . . . . . . 58





Theoptimalimplementationof matrixmultiplicationonmodernprocessorsis of great
importancefor many scientific andengineeringapplications[26, 27, 45]. The primary
reasonfor suchimportanceis that, given a high-performanceimplementationof matrix
multiplication, most denselinear algebraoperationscan be efficiently implementedon
variousplatforms[22, 36]. In turn, linearalgebrais centralto many aspectsof scientific
computing. Anothermajor reasonfor suchimportanceis that matrix multiplication in-
herentlyhasgooddatalocality andalgorithmfeaturesto beexploited to obtainthepeak
performanceon theseadvancedprocessors[26, 27].
Achieving theoptimalperformancefor matrix multiplication,however, hasbeencon-
tinuouslychallengedboth by the ever-wideningperformancegapbetweenthe processor
andmemoryhierarchyandthe introductionof new architecturalfeatures[32] in modern
architectures.Theconventionalwayof dealingwith thesechallengesbenefitssignificantly
from theblockingalgorithm[25, 38,41, 48] andhighly optimizedinnerkernels [30, 46].
By carefullydividing thematrix into thesubmatricesthatcanfit into the top level of the
memoryhierarchy, the blocking algorithmhelpswith bridging the performancegapbe-
1
2
tweenthe processorand the memoryhierarchy. On suchsubmatrices,the inner kernel
routinesperformmatrix-multiplication-typeoperationsthat exploit the specificarchitec-
tural featuresof theprocessor. Suchoptimizationsat thekernellevel areoftencodedatthe
assemblylevel becauseof thedemandfor optimalperformanceandtheinability of current
compilertechnologyto deliverpeakperformanceon thetriply nestedloops[29, 46].
In thisthesis,severalrecentapproachesin thisfield arestudiedandunifiedfor pursuing
thehigh-performanceimplementationof matrixmultiplication.
1.2 Hypothesis
The hypothesisof this thesisis that strategiesthat arerefinedandsynthesizedfrom
conventionalmethodsandrecentapproachesoffer betteroverallmatrixmultiplicationper-
formancethancurrentstate-of-artimplementations.In this context, strategiesrefer to the
waysthatactivities canbedoneandbeexploitedin parallelwithin theprocessorandthe
memoryhierarchy. In particular, by carefullyoverlappingandsequencingdatamovement
andcomputationalactivities, this studyachievesflatter andhigherperformancethanthe
vendorlibrariesandthelibrariesthataredrivenby theself-tuningmethodology.
The following backgroundintroductionandquantitative descriptionsof someimpor-
tanttermsfurtheraccomplishthis hypothesis:
Conventional methods:Thesereferto thewell-known blockingtypeapproaches[25, 38,
41,48].
3
Recentapproaches:Theserefer to therecentvariantsof blocking typeapproaches[28,
30] andtherecentresearchon hierarchicaldatastructures[43, 47].
Flatter performance: Thedegreeof flatnessof approachA’s performance, , is de-
fined in Section4.2.2over a certainrangeof matrix size. The performanceof approach
A is flatterthanperformanceof approachB if andonly if  on this rangeof
matrix size.
Higher performance: Over a certainrangeof matrix size,theperformance (mea-
suredin MFLOPS1) of approachA is higherthanB if andonly if  "!#%$&('*)+ .
A detailedderivationof this formulais alsogivenin Section4.2.2.
1.3 Moti vation
The motivationof this thesisarisesfrom the importanceof high-performanceimple-
mentationof matrix multiplication, the challengesposedby the complexity of modern
architectures(especiallycachearchitectures),anda few observationsin previousexperi-
mentswith differentmatrixmultiplicationimplementations:
, Thesamenumberof cachemissesin two competitivealgorithmsdoesnotnecessar-
ily degradetheoverallmatrixmultiplicationperformanceequally;
, The maximal matrix multiplication kernel efficiency is not a guaranteeof global
minimalmultiplicationtime; and,
, Currentarchitecturemodelsdo not incorporatecertainimportantarchitecturalfea-





of kernelcomputationsandmemoryaccessesin theglobalpicture,which play important
rolesfor thedeliveredmatrixmultiplicationperformance.
1.4 Contributions






For the relatively large matrix size, the approachesthat useMorton Orderingasan in-
termediateformat for native row majoredor columnmajoredmatricesis not worthwhile
comparedto theTLB-blocking approachthatexploits theintermediatestoragebuffer and
doesincrementalcopying thatareexplicitly designedfor TLB optimization.
PerformanceCritical Path
TLB misses,inter-cachepeakbandwidth,andregisterfile sizeevidently reducethenum-
berof choicesavailableto achievemaximumperformance,sothatrelatively simplestrate-




Theorganizationof this thesisis asfollows: ChapterII summarizes omebackground
and relatedwork in this field. The researchmethodologyof this study is presentedin
ChapterIII, followedby the introductionof performancemetricsof this thesisandtheir
measurementsin ChapterIV. ChapterV discussestheexperimentalsetupin detail,presents
themeasuredresultsof theinterestedperformancemetrics,andgivesaperformanceanal-
ysisto validatethehypothesis.ChapterVI concludesthethesisandsuggestsfuturework.
A glossarythatlists termsassociatedwith this thesiswork is givenin theappendix.
CHAPTERII
LITERATUREREVIEW
Matrix multiplication is a fairly simpleoperationfrom a mathematicalpoint of view
and is also a fairly computationallyrich operation,which makes it arguably the most
importantoperationof mostlinearalgebralibrariesonhighperformancearchitectures[14,
22, 26, 27, 45]. Section2.1 offers a brief introductionconcerningmatrix multiplication
andits uniqueimportancein linearalgebralibraries.
Nevertheless,the optimal implementationof matrix multiplication on computerar-
chitectures,especiallycachearchitectures,is not easilyachievedbecauseof the inherent
complexity of thesearchitectures.It is recognizedthatto beableto achieve thepeakper-
formancefrom software,the algorithmdesignermusthave a goodunderstandingof the
underlyingarchitecture,that is, what thedifferentcomponentsof a computersystemare,
how thesecomponentswork andinteractwith eachother, andmostimportantly, how they
determinethedeliveredperformanceof acomputersystem[12, 32]. An architecturestudy
of cachearchitecturesis consequentlynecessaryandis givenin Section2.2.
Therehave beendecadesof researchefforts devoted to the optimal implementation
of matrix multiplication. Section2.3presentsseveralconventionalapproachesandrecent
6
7
approachesin this field anddiscussestheadvantagesanddisadvantagesof eachapproach
whereverappropriate.
2.1 Matrix Multiplication and Linear Algebra Libraries
2.1.1 BasicConceptsand Algorithms
Generally, thematrixmultiplicationoperationin scientificcomputingis in theform of
-/. $013254 - , where 0 and 4 areconstants, is a  687 matrix,  is a 7 6:9
matrixand
-
is a ;69 matrix. Densematriciesareassumedthroughoutthisthesis.Row
or columnmajorstorageis assumedfor matricesA, B, andC. MatricesA andB mayalso
be transposed.Figure2.1 shows anexampleof suchmatrix multiplicationwhen 0=<> ,
4?<@ ,  <@A , 9><3A and 7B<@A .
C D D C D E F D D F D E C D D F D D G C D E F E D C D D F D E G C D E F E E
C E D C E E F E D F E E C E D F D D G C E E F E D C E D F D E G C E E F E EH I
Figure2.1An Exampleof Matrix Multiplication
Strassen’s algorithmis not consideredin this thesisbecauseof its extra memoryre-
quirementsandpoordatalocality [27], althoughit hasa lower arithmeticcomplexity and




algorithmrepresentationfor matrix multiplication,which is ba-
sicallya triply nestedloopasshown in Figure2.2.
for(i = 0; i < M; i++) {
for(j = 0; j < N; j++) {
for(k = 0; k < K; k++) {





Otherformatsof matrix multiplication update,suchasSaxpy formulation,areavail-
ablebut arenot goodfrom thecomputationalpoint of view [27] andthusnot chosenfor
representingthematrixmultiplicationin thealgorithmlevel. Anotherreasonfor choosing
this
 
representationis to betterreflectthechangesin thealgorithmlevel whenadapting
theblock representationof matrixmultiplicationlater.
Thenumberof floating-pointoperations(flops)of matrixmultiplicationis in theorder
of AJ@9?7 , and @9+29?7+2@7 matrixentriesareinvolved.Thismeansthatasymptot-
ically thereare KMLONP floating-pointarithmeticoperationsperformedon KQMLSRT memory
operandsfor squarematrix multiplicationwith matrix size L . Thegooddatalocality and
high F:M ratio (thenumberof floating-pointoperationspermemoryoperation)[45] pro-
9
motesmatrix multiplication to be thepreferredoperationon high performancecomputer
architectures[27, 45].
2.1.2 Matrix Multiplication in the Linear Algebra Libraries
The optimal implementationof linear algebralibrariesgoesbackto the mid-1970s’
LINPACK library [21] and level-1 BasicLinear AlgebraSubprograms(BLAS) [39] on
vectorsupercomputers.Later, thelevel-2BLAS [23] werealsointroducedto obtainhigher
performancethan level-1 BLAS on vector supercomputers.With the advent of cache
memoryhierarchysystemin the late 1980s,level-3 BLAS [22] andLAPACK [13, 14]
werestandardizedto make full useof this advancedmemorysystem.
It is shown by theseefforts that thehigh performanceof matrix multiplicationcanbe
transformedto the high performanceof most linear algebraoperationson cachearchi-
tectures[36]. Also, the computationalrichnessof matrix multiplication providesmany
opportunitiesto achievenearpeakperformanceon cachearchitectures.
2.2 Ar chitecture Study




processors(e.g., Intel Pentium4 andAMD Athlon) havecopiedthebestattributesof RISC
10
processors,that is, employed a RISC-like strategy in their microcodedesignandimple-
mentation,to obtainthesamelevel of performanceasRISCmachines[11, 26, 34].
Basedon all thesefacts, the architecturesof interestin this study are the modern
RISC/CISCprocessorswith a deepmemoryhierarchy, which encapsulatethe interesting
architecturalfeaturesthat determinethe deliveredperformanceof matrix multiplication.
A goodexampleis the CompaqAlpha platform [19], which is also the testbedof this
thesis.Currentx86 architectures,however, will beusedasthe referencearchitecturesin
this architecturestudyastherearemoreliteratureinformationavailablefor this processor
family.
2.2.1 ProcessorMicr oarchitecture and The PeakPerformance
The modernprocessor’s microarchitectureis designedto deliver the higheston-chip
parallelism[32, 37]. Someimportantarchitecturalfeatureskeyed to the performanceof
matrixmultiplicationaresummarizedbelow:
Superscalarpipelining: By parallelingmultiple pipelineswith thehelpof hardwareand
software,mostmodernprocessorscanexecuteacombinationof severaloperationsin one
cycle. For example,on an AMD Athlon processor, up to nine OPs(the RISC-typein-




Out-of-Order execution: To be able to betterexploit the availablehardwareresources
andresolve the dependenciesthat may stall the pipelines,modernprocessorsoften im-
plementthe out-of-orderunit, which canre-orderthe instructionstreams,executethem
speculatively (out of the original programorder), andcommit themin the original pro-
gramorder.
Data Prefetching: A cost-efective solution for bridging the performanceof processor
andmemoryhierarchyis theuseof cache.However, thecacheonly fetchesdatawhenthe
processorexplicitly asksfor it. Mostprocessorsnowadayshavedataprefetchinghardware
thatoverlapthepossiblecachemisseswith theprocessorcomputation,increasingtheuti-
lization of memoryhierarchybandwidth,andhiding thememoryaccessinglatency being
seenby theprogram[44].
Data level parallelism: Exceptfor squeezingout on-chip instructionlevel parallelism,
modernprocessors eekto exploit data-level parallelism. A goodexampleis the SIMD
(Single Instruction Multiple data) unit being seenon current x86 architectures:Intel
SSE/SSE2andAMD 3DNow! [11, 34].
Table2.1shows theimportantcharacteristicsof currentx86 processormicroarchitec-
ture(adaptedfrom [11, 33,34]). More specificinformationcanbefoundin [9].
2.2.2 Memory Hierar chy and Delivered Performance
Thememoryhierarchyonmodernprocessorscanbeabstractedin apyramid,asshown
in Figure2.3.Theregistersareat thetopof thememoryhierarchy. They arealsotheclos-
12
Table2.1TheCharacteristicsof x86ProcessorMicroarchitecture
Feature PentiumIII Pentium4 Athlon
Core P6 NetBurst QuantiSpedd
Microarchitecture Microarchitecture Microarchitecture
PipelineLength 12 20 10(Int),15(FP)
ExecutionUnits 2 ALU/FPU 2 ALU/FPU 3 ALU/FPU/AGU
Registers 8 GPR 8 GPR 8 GPR
8 x87 8 x87 8 x87
8 SSE 8 SSE/SSE2 8 SSE
SinglePrecision x87: 1 x87: 1 x87: 2
Speedup SSE:4 SSE:4 3DNow!: 4
DoublePrecision x87: 1 x87: 1 x87: 2
Speedup SSE2:0 SSE2:2 SSE2:0
est storagedevice to the CPU core1. From the top down in this hierarchicalstructure,
L1 cache,L2 cache,TranslationLookaside-TableBuffer (TLB), mainmemory, anddisk
storagearethecommonstoragedevicesseenin modernarchitectures.Thefurtheraway it
is from theCPUcore,thecheaperthestoragedevice is andtheslower theaccessspeed.
Theimportantmetricsfor measuringtheperformanceof memoryhierarchyarethemem-
ory bandwidthand memorylatency of eachlevel of memoryhierarchy. The width of
therows betweenthelayersin Figure 2.3 illustratesqualitatively theamountof memory
bandwidthis availablein between,andthememorylatency hastheoppositetrendof the
memorybandwidth.
1It usuallytakesaboutonecycle to load the datafrom registersto executionunits,but this latency can
usuallybehiddenby pipelining.Thus,theregisterandCPUcoreareshown asonecomponentin Figure2.3.
13
U V W X Y Z [
\ ] U ^ X _ [
\ ` U ^ X _ [
a ^ b c a [ d Y Z e
f b g h i j Y Z ^ k [
Figure2.3ThePyramidof MemoryHierarchy
Theimportantmemoryhierarchyparametersfor currentx86architecturesaregivenin
Table2.2(adaptedfrom [9, 11,34]). Thedetailedmodelingof thememoryhierarchyand
thecorrespondingoptimizationstrategy will beofferedin ChapterIII.
2.3 Previous Matrix Multiplication Work
An importantlessonlearnedin thestudyof theunderlyingarchitectureis thatthepeak
performanceis only available at the top level of memorysystembecauseof the huge
memoryaccessoverheadassociatedwith accessingdatain thelower level of thememory
system.Thus,the fundamentaloptimizationstrategy for matrix multiplication is to keep
datain thetop level of thememorysystemandto reducethedatamovementbetweenthe
differentlevelsof thememoryhierarchywhile keepingthepipelineasbusyaspossible.
Dif ferentapproachesthat possessthis fundamentaloptimizationprinciple have been
developedto achieve the high-performanceimplementationof matrix multiplication on
14
Table2.2MemoryHierarchyParametersof Currentx86 Architecture
Feature PentiumIII Pentium4 Athlon
L1 32Bytes/Line 64Bytes/Line 64Bytes/Line
I-Cache 16 KB, 4-way 12K uops,8-way 64 KB, 2-way
D-Cache 16 KB, 4-way 8KB, 4-way Exclusivewith L2
L2 32Bytes/Line 128Bytes/Line 64Bytes/Line
512KB .. 2 MB 256/512KB 256KB
Cache 4-way 8-way 16-way
L1 to L2 0.5,1 6 1 6 1 6
Speed ClockSpeed Clock Speed ClockSpeed
L1 3 2/6 3
Latency cycles cycles cycles
L1 + L2 7 - 27 9 - 13 11 - 20
Latency cycles cycles cycles
variousarchitectures.This sectiongivesan overview of theseapproachesanddiscusses
their advantagesandlimitations.
2.3.1 Early Stageof the Blocking Algorithm
Theearlyexplorationof blockingalgorithmsfor thehierarchicalmemorysystemcan
befound in [25, 38, 41, 48]. Dif ferentstrategieswereemployedfor deriving theoptimal
blocking factors: Gallivan, et. al. [25] useda decouplingmethodologyto capturethe
importantsystemparametersandmodelthe blocking algorithm,Wolf andLam [38, 48]
improved the datalocality basedon the datareuseand loop transformationtheory, and
SchreiberandDongarra[41] studiedautomatictransformationandblockingfor nestloops.
15
Throughtheseearlystudiesof matrix multiplicationandblockingalgorithm,it is rec-
ognizedthatthehighperformanceof matrixmultiplicationcanbeachievedby combining
a high-level blockingstrategy with a carefully tunedinnerkernelfor computingthesub-
matrixproduct.However, theoptimalblockingfactorsaredeterminedby variousmachine
parameters.Becauseof their inherentcomplexity, the performanceoptimizationis rela-
tively laborintensivefor complex multi-level memories[46].
2.3.2 Self-Tuning Libraries
ThePHiPAC project[15] demonstratedthatautomaticturning is achievableby using
parameterizedcodegeneratorsfor producingmatrixmultiplicationcodesin highlevel lan-
guages,with theexpenseof severaldaysor weeksof optimizationprocess.Next, theAu-
tomaticallyTunedLinearAlgebraSoftware(ATLAS) project [3, 46] reducedthesearch
spacefor automatictuning by generatingandtargetingonly onespecifickernelroutine.
Othertradeofs arealsomadein ATLAS to furtherconstrainthenumberof differentblock-
ing implementationsbeingconsideredsothat thetotal tuningtime is shortenedto within
theorderof severalhours.
While the performanceof self-tuninglibraries is competitive with the vendorhand-
tunedBLAS routines,the empiricalphilosophybehindtheselibrariesandthe strategies
beingemployedin themhave thefollowing shortcomings:
Compiler dependency:Whaley, et.al. [46] statethatATLAS only requiresan“Adequate
ANSI C compiler” becausethe codegeneratorcanbe trainedto performthe complier’s
16
optimizationwork. However, ATLAS’s performanceoptimizationability is stronglytied
to thecompiler’s instructionschedulingability. For example,comparedwith gcc2 2.95.x,
ATLAS observesa10%to 50%performancedropwhenusinggcc3.0.Thereasonfor this
performancedegradationis thatgcc3.0schedulesthe loadoperationsslightly differently
thangcc2.95.x[1]. Hence,animplicit sensitivity to compilerversionsexists.
Assemblycodedependency:Whaley, et. al. [46] alsostatethat “ATLAS is written en-
tirely in ANSI/ISO C.” However, in thelatestversion(3.5.0)of ATLAS , assemblycoded
kernelswere usedon many platforms, including the Intel Pentium4 processor, AMD
Athlon processor, PowerPC/Altivec architecture,and SUN UltraSparcarchitecture[2].
Therearebasicallytwo reasonsfor this apparentcontradiction:
, ATLAS is not adaptive to the new architecturalfeaturesbecauseof the compiler
dependency. SuchexamplesincludetheSSE/SSE2optimizedkernelsfor Intel plat-
form, the3DNow! optimizedkernelsfor AMD Athlon, andthekernelsthat incor-
poratedthemachine-specificprefetchinginstructionsfor UltraSparcarchitecture.
, ATLAS’s kernelstrategy is not optimal. Thusit hasto incorporatebetterassembly
kernelsfor keepingthe performanceadvantages.Suchexamplesincludethe opti-
mizedassemblykernelsfor theAMD Athlon Processor[2].
Jaggedperformancecurve: ATLAS’sperformancecurveis notflat with respecto matrix
sizeandthisunderminesthelibrary’sperformancepredictability3.
Non-optimal performance: Therearetwo major reasonsfor the inability of ATLAS to
deliver theoptimalmatrixmultiplicationperformance.
2GNU CompilerCollection,see[6]
3The term predictability refers to “having a small upper bound on the variation in performance
(MFLOPS)with respecto matrixsize” [43].
17
, As alreadyclaimed,ATLAS’s kernelstrategy is notoptimal.
, As will beshown later, ATLAS’sblockingstrategy is notoptimaleither.




dependentvariablesfor theoptimizedperformance.Specifically, by useof a divide-and-
conqueralgorithmandan “ideal-cache”modelproposedin [16], Frigo, et. al. [24, 40]
provethatsuchacache-obliviousalgorithmachievesthesametheoreticalboundoncache
missesfor the sameamountof work load. Although currentlythereis no concretehigh
performanceimplementationof sucha cache-oblivious algorithmfor matrix multiplica-
tion, similar ideaswereusedin thehierarchicalmethodsfor pursuingtheoptimizedmatrix
multiplication [43, 47].
Noticethatthe“ideal-cache”modelthatcache-obliviousalgorithmsarebasedondoes
not incorporatesomeimportantarchitecturalfeaturesthat are key to achieving optimal
performance.The correspondingoptimizationtechniquesthat exploit thesefeaturesare
thenleft outof thescopeof thisasymptoticoptimalalgorithm:
, Cachemissescanbeoverlappedwith theprocessorcomputationalactivitieswith the
helpof out-of-orderexecution,dataprefetchingandcarefulsequencingof memory
operationsandfloatingpoint operations.
, TLB misseshave moredramaticeffectson performancethancachemisses.It re-
mainsquestionableasto whetherthe cache-oblivious algorithmis naturallyTLB-
oblivious.
18
For theoptimal implementationof thematrix multiplication on the modernarchitec-
tures,onehasto combinetherecursionstrategy of cache-obliviousalgorithmswith itera-
tion methodsthattakeadvantageof theabovementionedarchitecturalfeatures.Thisstudy
is built on andbenefitsfrom bothstrategies.
2.3.4 Hierar chical Data Structures
Thequadtreerepresentationof recursivematrixstoragewasusedbyFrensandWise[47]
with thedemandof upto 78%extramemory. Thisextramemoryrequirementmayor may
not beacceptableon virtual memorysystems,but is not acceptableon realmemorysys-
tems. The samerecursive algorithm is also employed by Chatterjee,et. al. [18] with
improvedstorageformat,but theperformancewassignificantlydegradedbecauseof the
computationsperformedon the paddingelementsfor handlingof arbitrarysizedmatri-
ces. Gustavsonalsoemployedpaddingin the recursive datastructureanddevelopedthe
correspondingrecursive algorithm,which producesgoodperformancebecausethecom-
putationsarecarefullydesignedto not performon thepaddedelements.The4D storage
formatthatis similar to whatChatterjeediscussedin [17] andtheusageof tablewerealso
suggestedanddescribedby Gustavson[31].
More recently, ValsalamandSkjellum[43] proposedanew hierarchicaldatastructure
for the storageof matrices. This hierarchicalstorageformulationhandlesarbitrarysize
matriceswhile avoidingtheuseof tables,extramemory, or computations.It alsoimproves
data locality by the utilization of Morton ordering. Benefitsof using this hierarchical
19
datastructurealsoinclude:high performance,flat performancecurves,avoidanceof data
copying cost,andadmissionof polyalgorithms4.
2.3.5 Variants of NewBlocking Approach
Somerecenttreatmentsof blockingproducehighperformanceonvariousarchitectures
andaresummarizedbelow:
Multiple-le vel Blocking: By benefitingfrom a simpletheoreticalmodelof hierarchical
memories,theITXGEMM project[30] yieldsimplementationsthatoutperformself-tuning
librariesandrivals. The theorydefinedby the ITXGEMM researchersuggeststhatdif-
ferentalgorithmsthataredependenton thematrixdimensionsateachlevel of thememory
hierarchymustbeemployedfor achieving maximalperformance.This theoreticalcorol-
lary is consistentwith the favor of the polyalgorithmapproachin [43]. The theoretical
formulaof ITXGEMM couldalsobeutilized by self-tuningmethodsto reducethesearch
space.
Optimization for TLB: It wasrecognizedthatthetimespentfor handlingTLB missescan
accountfor 40%to 80%of total run time for theworse-casescenarioonavirtual memory
system[35]. Theblockingalgorithmusedby Strazdins[42] thatexplicitly considersTLB
sizewhenchoosingtheblockingfactorsachievesgoodperformance.However, Strazdins’
work is doneon a physicallytaggedcachemachine(SunUltraSparc)thatemphasizesthe
4Thetermpolyalgorithmswasintroducedby ProfessorJohnRiceandrefersto “The choiceof onesuit-
ablealgorithmfrom a setof candidatealgorithms,all designedto solve thesameproblem,with thegoalof
obtainingthebestpossibleperformance.” [43]
20
utilizationof TLB, andthemachineclockcycle is relatively low (170MHz) sothattheop-
timizationwork is not challenging.Recently, GotoandvandeGeijn [28] proposedanew
optimizationstrategy that is drivenby theminimizationof TLB missesandtheir libraries
achievehighly competitiveperformanceon variouskindsof highperformancecomputing
platforms[7], includingIntel Pentium4, Itanium,andAMD Opteronprocessors.
2.4 Lessonslearned
Thisliteraturereview emphasizedseveralimportantpointsthatarekey to high-performance
implementationsof matrixmultiplication.They areasfollows:
1. Blocking is a fundamentaloptimizationstrategy for matrixmultiplication;
2. Themechanicalparametersearchingis usefulif thecomplexity of theoptimization
problemcannotbeclearlyidentified;
3. Therecursivestrategy canobtaingoodperformance.But it hasto beboundwith the
iterationmethodsthatcanexploit theunderlyingarchitecture;and
4. Not only thecachesystem,but alsotheTLB is aperformance-criticalconstraint.
CHAPTERIII
RESEARCHAPPROACH
The researchmethodologyof this thesisis to identify the architecturalperformance
bottlenecksthroughadvancedarchitecturestudyandmodeling,thendeterminethe opti-
mizationprioritiesandstrategiesaccordingly, andattackthesebottleneckswith thehelpof
a hierarchicaldatastructureandoptimizedkernelroutines.This chapterprovidesa qual-
itative analysisof different optimizationapproaches,describesthe researchframework
that is basedon thecombinationof therecursive orderingandaniterationalgorithm,and
presentstheintegrationandtuningstrategiesof this study.
3.1 TLB-Ob vious Ar chitectural Modeling
3.1.1 Review of Memory AccessProcess
Previousarchitecturalmodelsfor matrix multiplicationoftendismissedthepresence
of TLB misses,which have beenidentifiedas the performanceculprit of matrix multi-
plication by somerecentresearch[7, 28]. Thus,it is necessaryto examinethe practical
memoryaccessprocedureson modernarchitecturesandincorporatetheperformancerole
of theTLB into theglobalprocessingpicture.
21
22
l m n o p q r
s t u v w x l y o z r
s { l y o z r
s | l y o z r
} y ~  } r  p q 
 ~ q   y    q r  
 s 
m z   ~ o y    q r     q ~ r 
} r  p q  } y  y  r  r   n  ~ 
Figure3.1TheDiagramof MemoryAccessProcess
23
As shown in Figure3.1,datafetchingactivities aredonein parallelby boththeMMU
(memorymanagementunit) andthecachesystem.TheTLB dealswith memoryrequests
thatcannotbeaccomplishedin thecachesystem.TheTLB canbedeemedasa cachefor
the pagetable,which recordsthe mappingfrom the virtual addressto the physicalpage
addressandis usuallystoredin main memory. If anentry is found in theTLB, thenthe
virtual to physicaladdresstranslationcanbe donepromptly. However, if the datais not
found in the TLB, the TLB missrequiresinterrupthandling(for example,for software-
managedTLB, CompaqAlpha processorandIBM POWER processors;see[35]), which
in turn will flush the pipelines,fetch thepagetablefrom themain memory(at leasttwo
morememoryreferences),updatetheTLB entry, andloadthedesireddataentry. All these
activities contribute to long pipeline stalling time. The cachemisses,however, do not
necessarilystall thepipelines.Theprocessorcancontinueprocessingtheinstructionsthat
areindependentof thecache-missedentries.
3.1.2 Modeling of Blocking Algorithm
Blocking is a usefuloptimizationstrategy to improvedatalocality andreducetheac-
cessmissesresultingfrom thememoryhierarchy. Pseudocodefor the
 
variantblocking
algorithm is given in Figure 3.2. Notice that the loop order of
 
is often exchange-
ableandentailsmultiple levels of loopsthat may be employed for multiple-level cache
blocking. Usually thecopy processis alsoaccompaniedby possibletranspositionof the
24
submatrixfor optimalkernelperformance.Clean-upcodefor theblockingsizethatcannot
beevenlydividedby P, Q, andR areskippedfor thesimplicity of this pseudocode.
BasedonFigure3.1and 3.2,theperformanceof thisblockingalgorithmonthevirtual
memorysystemscanbedecomposedasfollows:
 <  2 J 2  $  2 ¡  (3.1)
where

is the overall executiontime of the blocking algorithm,

is the execution
time of thekernelroutines(which mayhave differentformatsandassumptionsof initial
positionsof matrix entries),
¢
is the pipelinestalling time becauseof TLB misses,
£
is the total time of memoryaccesses(which includesall levelsof cacheaccesses
andmain memoryaccessesbut not the TLB accesses),







to be overlappedwith the computationby prefetchingtechniques),and
¡ 
is the cost
of copying the matrix into the global buffer, which occursin most blocking algorithm
implementations.
Theblockingsizeis animportantfactorfor thedeliveredperformanceof matrix mul-
tiplication. Kernelperformance,measuredby

, will only reachnearpeakperformance
whenthematrix entriesarestoredin thetop level cacheof thememoryhierarchy. Since
thecachesizeis usuallylimited, theblockingsizebeingseenin theconventionalblock-
ing algorithmsis usuallyquite small, which meansthat morecopy overheads
¡ 
are











for(sj = 0; sj < N; sj += Q) {
for(sk = 0; sk < K; sk += R) {
/* Copy part of B to the global buffer SB */
COPY(SB, B, Q, R, IF_TRAN);
for(si = 0; si < M; si += P) {
/* Copy part of A to the global buffer SA */
COPY(SA, A, P, R, IF_TRAN);
/* Kernel computation, C <- SA’ * SB */















3.1.3 Blocking for TLB
Naturally, onewould wantto increasetheblockingsizeto reducethecopy overheads
¡ 
andtry to still keepthepipelinefull, thatis, keepthesumof  J 2 £ $  
equalto the original processingtime. Sustainingpeakperformancefor a big submatrix
demandsconsiderationof thefollowing threefactors[8]. Onerequirementis thatmemory
bandwidthfrom thelowerlevel of memoryhierarchyto theprocessorcorebewideenough
for streamingthedatain. Anotherrequirementis thatthememorylatency from thelower
level of memoryhierarchyto the processorcorehasto be well hiddenso that the time
for the cachemissescanbe overlappedwith the kernel computationtime. Prefetching
techniques,including both hardware prefetchinginstructionsand software prefetching,
are helpful with this respect. Thesetwo requirementswill minimize  £ $   as
muchaspossible.The third factor, which concernsthe TLB, comesfrom the following
factspresentedonmodernarchitectures:
1. TLB missesaremoreexpensive thanbothL1 andL2 cachemisses,
2. TLB missesdonotoccurif thememoryentryis foundin thecachesystem,
3. TheTLB usuallyis biggerthanL1 cachebut smallerthanor equalto L2 cachesize,
and
4. A few cachemissesmaybetolerated,but notTLB misses.
27
Componentsof the memory hierarchyusually follow a pyramid scheme(seeFig-
ure2.3).Thismeansthatthelargerandlower levelsof cachearemoreexpensiveto access
thanhigherelements,asillustratedin Table2.2.Basedon Item1 and2, theTLB couldbe
regardedasonemoreextra “virtual cache”betweenL2 cacheandthemainmemoryat the
algorithmiclevel. This is theview of theconventionalblockingalgorithm,which makes
no distinctionsbetweencachemissesandTLB misses,reducingthemwith thesameap-
proach.But, basedon Item 3, theTLB shouldresidebetweenL1 cacheandL2 cachein
thememoryhierarchy, notbelow theL2 cache,asindicatedby items1 and2. Thisconflict
anditem 4 naturallyleadsto the following rulesfor performanceoptimizationof matrix
multiplication:
, TheTLB hasastrongerlimitation for theblockingsizethanL2 cache,and
, TheTLB missesshouldbehandledin adifferentwayascachemisses.
In short,theTLB playsadecisiverole for in theperformanceof matrixmultiplication.
Specificallyoptimizingfor TLB will allow big blockingsizesto beused,which in turnre-
ducesthecopy overheads ignificantly. Thisconclusionis alreadyshown by [7, 28]. Their
implementationsthatexplicitly consideredtheeffectsof TLB achievedhighly competitive
performancefor row-majoredandcolumn-majoredmatrixmultiplication.
3.2 The Matrix Multiplication Framework
A combinationof recursionanditerationstrategy is employed asthe framework of
this thesis. The actualcomponentsinclude the hierarchicaldatastructure,the iterative
28
algorithm,andtheoptimizedkernelroutinesfor theAlpha processor. Thematrix is first
hierarchicallydecomposedto submatrices.Recursive Morton Orderingis then applied
in onehierarchyto organizethesubmatricesfor bettertwo-dimensionaldatalocality and
betterexploitation of the memoryhierarchy. It is on suchsubmatricesthat an iterative
algorithm is carriedout for computingthe inner product,during which specialconsid-
erationis given to optimizationof the processorcomputationresource,andthe possible
overlappingof memoryaccessactivitiesandfloating-pointcomputationactivities.
3.2.1 Hierar chical Data Structure
Thehierarchicaldatastructureof this thesisis inheritedfrom variant1 of HERO [43],
which standsfor “hierarchicalextensionof recursive ordering.” This hierarchicalstorage
format is carefully designedto utilize Morton Orderingto improve datalocality in two
dimensionsand handlematricesof arbitrary size by taking advantageof the following
fact: any integercanbeexpressedasasumof powersof two.
Four levelsof divisionanddecompositionarerequiredto transformtheoriginalmatrix
into thenew hierarchicallystoredmatrix,asshown in Figure3.3.Paddingis first appliedto
ensurethatthematrixsizeis evenlydividedby theblockingsize,whichwill bedetermined
laterby consideringthecache,TLB, andpipelineoptimization.
Level 1 and2 decompositionaccomplishestherecursionprocess.In thefirst level, the
paddedmatrix is dividedinto blocks.Thematrixentriesinsidetheseindividualblocksare






























































































	 Matrix (Adoptedfrom [43])
30
is composedby the level 1 blocks. This smallmatrix is easilydecomposedto a seriesof
level-2 matrices,whosesizearepower of 2 andarrangedsuchthat thebiggestonestays
in the right-bottomcornerandthe smallestonestaysin the left-top corner. Insidethese
level-2matrices,thelevel-1blocksarerecursively organizedby Morton Ordering.
Thethird andfourth level areonly designedto uniquelyspecifytheorderof thelevel-2
matrices.Thelevel-2 matricesaretiled in thethird level andlevel-3 tiles arearrangedin
column-majororder(or row-majororder)at thefourth level.
3.2.2 Iterati veAlgorithm
Althoughdifferentalgorithmsmaybeemployedfor thishierarchicalstorageformatto
exploit theadvantagesof thepolyalgorithmicapproach,it is suggestedby severalprevious




variantiterative algorithmfor themultiplicationof the re-
cursively storedmatrix. Thepseudocodeof this algorithmis shown in Figure3.4. Notice
that the indexing for the third and the fourth level is trivial and is thusnot given. The
Morton-Orderedaddresscalculationfor the first andthe secondlevel is rathercomplex.
An algorithmthat is basedon fastintegerdilation methodis employedherefor reducing
suchindexing overheads[43].
Theperformanceof suchaniterativealgorithmcanbemodeledas:
¥¤ <  2 %J 2 £ $  2 ¤£¦*§¨%© (3.2)
31
for(Each Level-3 Tile) {
for(Each Level-2 Square Matrix) { /* Size is b*b */
/* Find the first matrix entry of each Level-2 Matrix */
FIND(SA2, SB2, SC2);
for(si = 0; si < b; si++) {
for(sj = 0; sj < b; sj++) {
/* Find the first entry of each Level-1 C Block */
SC1 = C + (SC2 + INDEX_MOR_ORI(si, sj)) * P * Q;
for(sk = 0; sk < b; sk++) {
/* Find the first entry of each Level-1 A and B Blocks */
SA1 = A + (SA2 + INDEX_MOR_ORI(si, sk)) * P * R;
SB1 = B + (SB2 + INDEX_MOR_ORI(sk, sj)) * R * Q;
/* Kernel computation, C <- A * B */
Kernel_MM(SC1, SA1, SB1, P, Q, R);










ing thecorrectMorton-Orderedindexesof matrix entriesbasedon their original column-
majororderindexes,andtheotherfactorspossessthesamemeaningasin Equation3.1.
Inspectionof Equations3.1 and 3.2 implies that thecombinationof thehierarchical
datastructureandtheiterativealgorithmleadsto thefollowing:
1. Good two-dimensionaldatalocality is achieved suchthat the performanceof the
memoryhierarchyis automaticallyoptimized;
2. Thetimeconsumingcopy costassociatedwith blockingalgorithmis avoided;
3. The Morton Orderindexing is doneby integeroperations,which arecheaperthan
memoryoperationsinvolvedin thecopy process;
4. Thefine-grainedcontrolover theprocessorpipelineis achievablewith the iterative
algorithm;and,
5. It is possibleto reassemblethe indexing process(integeroperations)andthecom-
putationprocess(floatingpointoperations)in aproperwaysuchthatthesuperscalar
architectureis exploited.
3.2.3 Optimized Kernel Strategy for Alpha
Until now, this thesishasonly discussedhigh-level modeling,the hierarchicaldata
structure,andalgorithms.This sectionpresentsthelowestlevel kerneldesignstrategy as
well asthespecificoptimizationdetailsfor thechosenAlpha 21264processor.
It wasidentifiedby ValsalamandSkjellum [43] that the “f at” interfaceof DGEMM
-/. $ª0«¬254 - is not suitablefor thekerneloptimization.First, this interfaceis de-
signedfor generalpurposematrix multiplicationandinvolvessomeparametersthatmay
not benecessaryfor kerneloperation,suchastheaccumulationparameter4 andthescal-
ing factor 0 . Also, thekernelperformanceis indeedsensitiveto thevariousparametersof
33
thekernelroutine,includingtheblockingsizefor  ,  , and - andthe 0 and 4 parameters
aswell as the loop unrolling factors. The optimizedcodefor onecasemay be largely
suboptimalfor anothercase.If toomany parametersareinvolvedin thekernelroutineand
onecomposesdifferentkernelsubroutinesfor eachpossiblecase,thecombinationof these
parameterswould leadto codeexplosionin thekernelcode.Evenself-tuninglibrariesthat
train thecomputerto generatetheC codefor differentkernelcaseswould needto make
somecompromisesothatonly asubsetof all possiblecasesaregenerated.
Therefore,this thesisadvocatesasimplifiedversionof theDGEMM interface:
-­. $
8	O . Thehandlingof 0 of 4 canbeaccomplishedby complementarysubroutines.Also,
insteadof codinga kerneltoward a fixed blocking sizeandunrolling factors,the kernel
C interfaceis designedto work with differentblocking sizesandunrolling factors. The
performanceof sucha kernelstrategy may not be optimal. However, aswill be shown
later, betteroverall performanceis still achievablewith the help of a memory-friendly
matrixstorageformatandyieldsthefollowingadvantagesfromusingasimpleyetadaptive
interface:
, flexible tuningandintegrationin thelaterphase,
, lessdevelopmentcostbecauseof thereducedkernelcases,and
, only ® slowerkernelperformance.
Theoptimizedkernelof this thesishasbeenprovidedby Mr. KazushigeGotoat Uni-
versity of Texasat Austin andJPTO [8] andis completelycodedin Alpha assemblyin-
structions.Themainadvantageof usingassemblyinstructionsotherthanhigh level pro-
34
gramminglanguagesis the delicatehandlingof the processorpipelineandcomputation
resources,which facilitatesthefollowing techniqueskeyedto optimizedperformance:
Flexible ResourceAllocation
Thecomputationresourcesavailableon theAlpha processorsarethe32 integerregis-
tersfor integer/load/storeinstructionsandthe32 floating-pointregisters.Directly manip-
ulatingthemallowsbetterimplementationof mostinnerkernelsandavoidsrunningoutof




two integerinstructions(includingbothloadandsave instructions)andtwo floating-point
instructionsin thesamecycle. Therefore,groupingtheseinstructionsthatcanbeexecuted
together(softwarepipelining)will explicitly exposethe instructionlevel parallelismand
boostperformance.Althoughtheout-of-orderunit on theAlpha processorwill helpwith
this respect,it is still beneficialto follow thesecoding principles. In Figure 3.5, four
instructions(two load instructions,one floating point add instructionand one floating-
point multiply instruction)canbeexecutedin onecycle andthusgroupedtogetherin the
kernelroutines.
Suchcodegroupsare largely usedin the kernel assemblycode. Sometimeswhen
resourceconflictsareexpected,the“unop” instructionis insertedto maintainthis coding
patternwhile waiting for theresourceto beavailableagain,asshown in Figure3.6
35
addt $f8, $f16, $f8
mult $f23, $24, $16
lda $3,  SIZE($3) # $3 = *($3 + SIZE)
# $f16 = $f23 * $f24
ldt $f23,  SIZE($4) $f23 = *($4 + SIZE)#
# $f8 = $f8 + $f16 
Figure3.5SampleAssemblyCodeI of SoftwarePipelining(Adoptedfrom [8])
addt $f1, $17, $f1
unop
mult $f20, $f25, $f17
# $f1 = $f1 + $f17
ldt $f25,  SIZE($6)
# 
# $f17 = $f20 * $f25
# $f25 = *($6 + SIZE)
Figure3.6SampleAssemblyCodeII of SoftwarePipelining(Adoptedfrom [8])
Overlappingof MemoryAccessandComputation
The ideaof increasingthe blocking size to reduceoverheadwill fail if peakperfor-
mancecannotbe preserved for big blocks. The orderof memoryaccessoperations,in-
cluding both load andsave instructions,thereforeneedsto be scheduledvery carefully
alongwith thefloatingpoint operationsto hidethehighermemorylatency implied by the
biggerblockingsize.Prefetchingtechniquesarealsohelpful for keepingthepipelinefull.
Theprefetchingdistancedependsonthelatency of thememoryhierarchy, which is given
in Table5.2. If a matrix entry is to beusedat cycle ¯ andit is estimatedto bestoredin
the
 
level of memoryhierarchy, thenit would needto be loadedat cycle !°¯±$5 ³²µ´¶   (' ,
whereL(i) is theaccesslatency from level
 
to level
  $5 of thememoryhierarchy. This
36
canbedoneby eitherthehardwareprefetchinginstructionsor anexplicit arrangementand
schedulingin theassemblylanguagelevel.
Of course,theability to prefetchdataentriesis not infinite, andmemorybandwidthis
alwayslimited no matterhow cleverly oneschedulesthe instructions.Thekernelperfor-
manceandblocking sizethat allows oneto sustainpeakperformanceis actuallylargely
dependenton the programmer’s skill. Fortunately, the kernelcreatedby Mr. Kazushige
Gotoobtainedaround96%of peakperformanceon a Alpha 21264processorin measure-
mentsdoneaspart of this thesiswork. For fairly big blocksthat arelarger thanthe L1
cacheof theAlpha processor, this kernelstill reachedabout93%of peakperformance.
CHAPTERIV
PERFORMANCEINTEGRATION, TUNING, AND BENCHMARKING
The performanceintegration, tuning, andbenchmarkingstrategiesof this thesisare
aimedatbalancingmemoryaccessandkernelcomputationsfor betteroverallmatrixmul-
tiplication performance.They areinterconnectedwith eachotherandintroducedtogether
in thischapter. Thefollowing sectionexplainstheintegrationandtuningstrategy, which is
employedat thefinal phaseof thisstudyto put theframework together. In orderto ensure
the fair comparisonof differentoptimizationmethodsandvalidationof the hypothesis,
the performancemetricshave to be well definedfrom the perspective of the application
users.Thesemetricsarespecifiedin Section4.2 anddescribedalongwith the practical
measurementproceduresfor obtainingtheir trustworthyvalues.
4.1 Integration and Tuning Strategy
Theintegrationof this matrix multiplicationframework is relatively straightforward.
Thehierarchicalstorageformatthatthisstudyis basedongrantsaflexible polyalgorithmic
approachfor matrix multiplication. In turn, theselectionof theiterativealgorithmallows
thedelicatehandlingof processoresources.Also, thereducedkernelinterfaceenablesthe
seamlessembodimentof optimizedassemblykernels.However, optimal performanceis
37
38
notobtainedby simplyputtingtheoptimalcomponentstogether. A carefultuningstrategy
is requiredto reachthepeakperformanceof theprocessor.
The basictuning strategy involvesmechanicalparametersearching,which is similar
to the ideabehindtheself-tuningstrategy, exceptthat the tuningof this studyis doneat
codingtime insteadof run time. Thereducedkernelinterfacealsohelpsheresincethere
areonly a few factorsthatneedto bespecificallyconsidered:theblockingsize · , ¸ , and
¹
thatcorrespondto theoriginalmatrix size  , 9 , and 7 , respectively.
Two tuning stagesare employed to ensurebetteroverall performanceachievement.
Thefirst stagetunesthekernelperformance.In thisstage,therangeof · , ¸ , and ¹ canbe
estimatedby performancemodelingasdescribedin Section3.1.Thenaspecialbenchmark
programis createdto measurethekernelperformancefor thisrangeof · , ¸ and ¹ , within
which the bestcaseswerepicked. In the secondstage,the global matrix multiplication
performanceis usedasthe performanceevaluationcriteria. Anotherspeciallydesigned
benchmarkprogramis appliedto monitor theoverall performancefor matrix multiplica-
tion with size  <º9;<­7»<¼JJ . Theblockingsizesin thesecondsearchingprocess
weresetupto betheonesfoundin theprevioustuningphase.
An importantperformancefeatureof the hierarchicaldatastructureis the extremely
flat performancecurve. Theoverall performanceflatnessis not only presentedfor differ-
entmatrixsizes,but alsoshownfor differentblockingsize.Thisattractivefeatureindicates
that the tuningprocedurewill choosetheoptimalblockingsizefor betteroverall perfor-
mancewithout beingaffectedby theconcretetuningparameters,suchasthematrix size
39
JJ usedin thefinal tuningstage.Figure4.1 illustratessucha performanceflatnessre-
gardingto theblockingsize · . In laterchapters,theflatnessof theperformancecurvewill
alsobemathematicallydefined,measured,andcompared.
½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö ×
Ø Ù Ú Û Ü Ý Þ ß à á
âã ä å
æ ç èé ê ë
ì í î
ï ð ñò ó ô
õ ö ÷ø ù ú
û ü ý
þ ÿ           	 
  







) * + ,
- . / 0
1 2 3 4 5
6 7 8 9 :
; < = > ?
Figure4.1PerformanceFlatnessRegardingto TheBlockingSize
4.2 UserLevel PerformanceMetrics
A basicprincipleof determiningthepropermetricsfor performancemeasurementis
that theperformanceof eacharchitecturalcomponentis not anaccurateindicationof the
overallperformance.Sinceeachapproachhasmadetheirown tradeofs betweendifferent
40
performancecomponentsin Equation 3.1 and 3.2, comparingthe performancecompo-
nentsalone,(e.g.,
£
will meanlittle for theglobalperformance),which is whatend
usersdesire.Thus,only theperformancethat is concernedby theendusersis measured,





In-cacheMFLOPS KernelPerformance 7 O< RA@
¦*CB#(EDGFIHCJLK CB#(EDNM PO
Out-cacheMFLOPS GlobalPerformance  O< RA@
¦*QFRHSJ
Degreeof Flatness GlobalPerformance *<UTWVPX Y @
FRHSJRZ\[












To ensureaccuratemeasurementof the in-cacheperformance,it is assumedthat all
thematrixentriesinvolvedin suchkernelroutinesarealreadystoredin cache.Practically,
this canbe doneby repeatingthe kerneloperationsseveral timeson onespecificsetof
matricesuntil all thematrixentriesarefetchedinto theL1 cache.Thenthetiming resultsof
matrixmultiplicationkernelroutine  areaveragedandthemeanvalue   is obtained.
However, sincethe in-cachecomputationis quite fast and the matrix size is relatively
small, thestandardC timing utility maynot besufficiently precise.A little bit assembly
instructionthatcanmeasurethecycles,
-a`Ebdc   , passedduringtheexecutionwill better
suit therequiredaccuracy.
Notice that thereis no standardformat andsizespecificationfor the kernel routine.
Dif ferentapproacheshave their own choicesof theseparametersfor the optimal perfor-
mance.For thefair comparisonof differentapproaches,it is necessaryto employ ametric
thateliminatesthe impactsof differentproblemsize. MFLOPSis suitablefor suchmea-
surement,andtheIn-cacheperformanceof approachA is thendefinedasfollows:
7 O< AJ@9?7-a`Ebdc   	 -e`Ebfc  Lhgjikiml (4.1)
4.2.2 Out-of-CachePerformanceof Matrix Multiplication
An importantmetric for evaluatingthe the actuallydeliveredperformanceof matrix
multiplicationis thenumberof out-of-cacheMFLOPSbeingobtained.Therearetwo ma-
jor differenceswhenmeasuringthe out-of-cacheMFLOPSand the in-cacheMFLOPS.
42
Onedifferenceis the initial positionof the matrix entries. While the matrix entriesare
assumedalreadyin cachein thecaseof measuringin-cacheperformance,measuringout-
of-cacheperformanceonly assumesthat the matrix entriesarestoredin main memory.
Anotherdifferenceis that the rangeof interestedmatrix sizefor two experimentsis se-
lectednot to overlapwith eachother. Sincethe matrix sizefor out-of-cacheproblemis
usuallyrelatively large, the standardC timing routing canbe usedto obtainthe average
executiontime
  of approachA.
Theout-of-cacheMFLOPScanbecalculatedby:
onpn 9qn¡7:O< AJ@9 7 enpn¡9rn   (4.2)
where
ts ! A¢¢unwv¢J¢ ' . Theminimumvalue A¢J is calculatedon thesizeof L1 cache,and
vxn¡JJ is calculatedon thesizeof mainmemory.
Degreeof flatnessis anotherimportantmetric for evaluatingtheoverall performance
of matrix multiplication. It simplifiesthetuningprocedureandprovidestheperformance
predictabilitythatis critical for realtimecomputingandparallelcomputing.
Given the out-of-cacheMFLOPS onyn¡9qn¡7: of approach , thedegreeof flat-
nessof approachA is definedasfollows:
*< zt{}| !#onpn 9qn¡7: '%$ z~L !#enpn¡9rn¡7:('z{f| !  on¡pn¡9qn 7:(' (4.3)
Theperformanceof approachA is aggregatelyhigherthanB if andonly if:
²
D
 !#enpn¡9rn¡7: $ Mtnpn¡9rn¡7:('«)= (4.4)
43
Thefollowing formulasdefinetheperformancecriteriafor thesquarematrixmultipli-
cationwith size L thatthis thesiswill experiment:
 =Nn whereS< z{f| !#onL ('%$ z~L !#enL ('zt{}| !#enL (' (4.5)M !#enL 1$ MtnL ('«)+ (4.6)
NoticethatneitherF(A) nor M(A) is meantto beanaccurateindicationof betterper-
formanceby itself. However, the combinationof generallylarger MFLOPSandsmaller
F(A) evidently shows the betterperformanceof oneapproachover anotherone. Also,
thesemetricsarenotmeantto beusedfor comparingthekernelperformancebecausethat






the next sectionis specificallydesignedsuchthat an objective andpreciseperformance
measurementof variousapproachesis achievable. The performancemetricsdefinedin
Section4.2areemployedfor ensuringunbiasedperformanceevaluation.Thebenchmark
resultsarethenpresentedin Section5.2and5.3.Theperformanceanalysisaregivenwher-
everappropriateto justify thecorrectnessof thehypothesis.Then,certainamendmentsof
thehypothesisaremade.
5.1 Notations and Experimental Configuration
This thesisstudiesfivedifferentimplementationsof matrix multiplication.TheTLB-
blocking algorithm(GOTO library) andthe hierarchicalmethods(including the original
HERO andtheoptimizedHPCL1 implementation)areimplementedby theauthorandthe
author’scollaborators.Referencelibrariesincludetheself-tuninglibrary (ATLAS) andthe
1HPCL is specificallythelabelfor thenew work of this thesis
44
45
vendorlibrary (CXML). For the sake of convenience,the framework of the approaches
beingstudied,theirnameandconcreteversion,aswell asthelabelthatwill beusedin the
performancetableandgraphsareall listedin Table5.1.
Table5.1Notationsof Dif ferentMethods
Framework Library NameandVersion Label
Blocking for TLB Goto Triangle
HierarchicalDS(DataStructure) HERO Multiply
HierarchicalDS+ OptimizedKernel HPCL Diamond
Conventionalblockingalgorithm CXML 5.1.0andCXML 5.2.0 Circle
Self-TuningMethodology ATLAS 3.2.1andATLAS 3.4.1 SquareandPlus
Basedon the reasonsspecifiedin Section2.2, this thesischoosesthe Alpha 21264
processorbasedplatformsto verify theproposedhypothesis.Both Linux andTru64Unix
basedsystemsareusedto setasidethepossibleperformanceinterferenceproducedby the
operatingsystem.Thecompilersusedareusuallythehighestversionsof GCCtools that
areavailablefor thatparticularmachine.Oneof thereferencelibraries,ATLAS, hasthe
compilerdependency, sothatanold versionof GCCneedsto beusedto obtainthehighest





Machine CompaqAlphaDS20Server HP AlphaServerSCV2.5UK1 system
Processor Alpha21264/EV6 Alpha21264A/EV67
Clock Rate 500MHz 667MHz
PipelineLength 7(Int), 9(FP) 7(Int), 9(FP)
L1 cachesize 64 Kbytes 64 Kbytes
L1 latency 2 cycles 2 cycles
L2 cache 4 Mbytes 8 Mbytes
L2 latency 8 cycles 8 cycles
TLB size 128Entries 128Entries
TLB misses Severalhundredscycles Severalhundredscycles
Pagesize 8 Kbytes 8 Kbytes
Main Memory 768Mbytes 4 Gbytes
Memorylatency About150cycles About150cycles
FPSpeedup Multiply by 2 Multiply by 2
PeakMFLOPS 1000 1333
Compiler GNU GCCand GNU GCCand
Used CompaqC Compiler CompaqC compiler





pects,areshown in Table5.2. Notice that from theperspective of computerarchitecture,
the Alpha EV6 andEV67 processorsbeingusedbelongto the sameprocessorgenera-
tion [20]. Theonly differencethatmayaffect theperformancebehavior is thedifference
of theclockspeed,whichcanbeeliminatedby employing theproperperformancemetric.
This way it is madesurethat the performancepropertiesbeingobserved comespurely
from the softwarelayers,not the hardwareside. Unlessexplicitly specified,the default
platform is the CompaqAlpha DS20server with Linux kernelversion2.2.14-6.0. The
Linux kernel2.4.9-32.5smpis only beingemployedbecausetheCXML wasnotavailable
on the default platform. Anotherdefault settingis that only the performanceof double
precisionsquarematrixmultiplication(of size  ) will bestudiedunlessotherwisestated.
5.2 Performanceof Inner Kernels
Themeasuredkernelperformanceof HPCL andATLAS arepresentedandcompared
in this section.Otherapproaches’kernelperformancearenot benchmarkedbecausethat
they areusingeitherthesimilarkernelwith HPCL(e.g., CXML alsousesthekernelmade
by Mr. KazushigeGoto [4]), or a much slower suboptimalkernel (as the caseof the
original hierarchicalframework).
48
5.2.1 PeakKernel vs AggregateKernel
First, therelationshipof thekernelperformanceandtheglobalperformanceis evalu-
ated.Thekernelof HPCL reachesthemaximalperformance, 97%of peakperformance,
when } ,  , and k (hereinaftercalledthe “PeakKernel”). However,
theblockingsizefor theoptimaloverall performanceis } ,  , and Uk
(hereinaftercalledthe“AggregateKernel”). Figure5.1 reportsthe in-cacheperformance
of the PeakKernel and the AggregateKernel and the out-of-cacheperformancebeing
obtainedby usingPeakKernelandAggregateKernelrespectively.
The AggregateKernelcanonly reach93% of peakperformance,which is about4%
slower than the performanceof the PeakKernel. However, the overall performanceof
using the AggregateKernel outperformsthe performanceof using the PeakKernel on
almosteachsignificantlylargematrix size. Thesetwo figuresshowthat maximumkernel
performanceis not a guaranteeof theoptimalglobal performance.
5.2.2 HPCL vsATLAS
This sectionshows a comparisonof ATLAS’s kernelperformanceandHPCL’s ker-
nelperformance.A coincidencebeingobservedis thatATLAS alsoreachesthemaximum
kernelperformanceatthesamesizeof HPCL’sPeakKernel.SinceATLAS generatesthou-
sandsof innerkernels,oneonly needsto comparewith its bestkernelperformance.Sucha
performancecomparisonis presentedin Figure5.2.TheHPCLAggregateKernelis about
3% higherthanATLAS’s fastestkernel,which runsaround90%of peakperformance.
49
      ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶











è é ê ë
ì í î ï ð ñ ò ó ô








           
      ! " # $ % & ' ( )
* + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D











x y z {
| } ~                      









± ² ³ ´ µ ¶ · ¸ ¹ º
» ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É
Figure5.1Performanceof PeakKernelandAggregateKernel
50











   
   	 
          








1 2 3 4 5





reasonsfor this performanceadvantage.Oneis theassemblycodingthatallows thedirect
manipulationandexploitationof theprocessoresources.Anotherreasonis the reduced
kernelinterfacethatavoidssomeunnecessaryparameters(andpossibleoverheads)of the
kernelroutine.
5.3 Performanceof matrix multiplication
Thesectionoffersadetailedperformancecomparisonof all five implementationsthat
includeGOTO, ATLAS, CXML, HERO, andHPCL (seeTable5.1 for moreinformation
on theselibraries). The out-of-cacheperformanceof theseapproacheson both Tru64
Unix andLinux basedplatformsarerelated,compared,andanalyzedin Section5.3.1and
Section5.3.2.
5.3.1 Explicit TLB blocking vs Implicit Memory Optimization
Theout-of-cacheperformanceof GOTO library thatexplicitly blocksfor thesizeof
TLB andHPCL implementationthat implicitly optimizedfor bothcacheandTLB layers
is presentedin Figure5.3
Theseperformancecurves exhibit the following two propertiesthat all promotea
polyalgorithmapproachfor theoptimizedimplementationof matrixmultiplication.:
52









j k l m n o p q r s t u v w x y z { | }








    
¡ ¢ £ ¤
Figure5.3TheOut-of-CachePerformanceof GOTO andHPCL
53
Flat Performanceof Hierar chical Framework: Theperformanceof HPCL reachesthe
peak(about870MFLOPS)for arelativelysmallmatrixsizeandthenstaysonaflat plateau
for arbitrarylargematrix size.In contrast,theGOTO library doesnot reachthepeakuntil
amatrixsizeof around¥§¦©¨ª¨ª¨ , andits performancewavedaroundthepeakperformancefor
certainmatrixsizes,especiallywhenmatrixsizeisnearapower-of-two integer. Obviously,
the utilization of the hierarchicaldatastructuremakes the HPCL implementationmore
favorableto theoptimizationof thememoryhierarchy.
PerformanceSwitching Point: Onaverage,theperformanceof HPCLimplementationis
higherthanGOTO library for matrix sizefrom ¥«¨ª¨ to around¥ª¦©¨§¨ª¨ , andtheperformance
of GOTO library is higherthanHPCL implementationfor matrix sizefrom around¥ª¦¬¨ª¨ª¨
to ­®¦©¨§¨ª¨ . This is becausethat the memoryaccessoperationis usually several orders
slowerthanthefloatingpointoperations.Whenthematrixsizeis small,thecopy costthat
hasa complexity of ¯±°³²µ´·¶ will still be considerablylarger thanthe computationalcost
thathasa higherordercomplexity of ¯±°³²µ¸¹¶ . Thusthehierarchalframework thatavoids
the copy costwill have a performanceadvantage.When the matrix size increasesto a
relatively largesizethatthememorycopy costcanbewell amortizedoverthecomputation






studied. The performancemetricsdefinedin Section4.2.2aremeasuredandthe perfor-
manceresultsaregiven.Basedon theseresults,thehypothesisof this thesisis validated.
ThemeasuredDegreeof FlatnessandAggregateMFLOPSareprovidedin Table5.3.
Thesevaluesarecalculatedbasedon thecorrespondingdefinitionfor matrix sizeranging
from ¥§¦©¨ª¨ª¨ to º®¦©¨ª¨§¨ . Theselectionof areais because,excepttheHPCL implementation,
otherlibrariesdo not reachthepeakperformanceuntil thematrix sizeincreasesto ¥ª¦¬¨ª¨ª¨ .
Also, it is moreinterestingto studythemultiplicationof big matricesfrom boththeoretical




Approach HERO HPCL GOTO ATLAS3.2.1 ATLAS3.4.1 CXML
EV6-Linux 2.7 2.6 4.4 59.4 14.8 6.2
Degreeof Flatness % % % % % %
EV67-Tru64 1.8 2.4 2.7 27 6.8 3.7
Degreeof Flatness % % % % % %
EV6-Linux 36.08 45.47 46.07 37.33 42.86 44.38
AggregateMFLOPS »¼¥«¨ª¨§¨ »¼¥«¨ª¨ª¨ »½¥«¨§¨ª¨ »¼¥«¨ª¨ª¨ »¼¥«¨ª¨ª¨ »¼¥«¨ª¨ª¨
EV67-Tru64 50.58 62.77 63.42 57.37 60.94 61.24




betterperformancethantheself-tuninglibrary (ATLAS) andthevendorlibrary (CXML).
NoticethatHERO implementationhasthebestdegreeof flatness,however, theaggregate
performanceof HERO is not goodbecauseit builds thekernelusingC-languagemacros.
HPCLimplementationintegratesamoreoptimizedkernelinto thehierarchicalframework,
andachievesbothhigherandflatterperformancethanall thereferencelibraries.
A completeperformancecomparisonof all five approachedon both Tru64Unix and
Linux systemareshown in Figure5.4 throughFigure5.7. Thematrix multiplicationare
benchmarked for size from ­§¨ª¨ to º®¦©¨§¨ª¨ as well as the power-of-two sizes,which are
specificallyselectedbecausethattheperformanceof matrixmultiplicationis known to be
sensitive for suchmatrix sizeor leadingdimensions.
Thefiguresexplicitly showsthatthehierarchicalandtheTLB-blockingbasedmethods
beatthe “state-of-art” libraries(ATLAS andCXML) on all the platformsbeingstudied.
TheGOTO library is alwaysthebestfor big matrix size,expectHPCL outperformsit for
thematrix sizelessthan ¥ª¦©¨ª¨§¨ . TheHPCL library is thesecondbestimplementation.It
hasaperformanceadvantageoverbothATLAS andCXML until thematrixsizeincreases
to around ¾¿¦©¨ª¨ª¨ , after that the performanceof HPCL is still competitive with ATLAS
andCXML. A performancedropnear ¾¿¦©¨ªÀªÁ is visible on all theperformancecurvesex-
cept the hierarchicalformulations,including both HPCL andHERO. By examiningthe
power-of-two performancegraphs,it is moreclearthat the blocking basedimplementa-
56
tions, includingGOTO, ATLAS, andCXML, have to suffer at nearpower-of-two matrix
sizesbecauseof thecachethrashingeffects. Theusageof hierarchicalwork avoidssuch
performanceanomaliesandprovidesthegoodperformancepredictabilityto theendusers.
Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø
Ù Ú Û
Ü Ý Þ
ß à á â
ã ä å æ
ç è é ê
ë ì í î
ï ð ñ ò
ó ô õ ö
÷ ø ù ú
û ü ý þ ÿ           	 
         








4 5 6 7
8 9 : ; < = > ? @ A
B C D E F G H I J K
L M N O
P Q R S T U V W X
Y Z [ \
Figure5.4Performanceof VariousApproacheson EV67-Tru64
57







   
   
   
   
      









Í Î Ï Ð
Ñ Ò Ó Ô Õ Ö × Ø Ù Ú
Û Ü Ý Þ ß à á â ã ä
å æ ç è
é ê ë ì í î ï ð ñ
ò ó ô õ
Figure5.5Powerof 2 Performanceof VariousApproachesonEV67-Tru64
58
ö ÷ ø ù ú û ü ý þ ÿ           	 









& ' ( )
* + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B





^ _` a b c d ef g h i j k l m n o
p q r s t u v w x y
z { | }
~        
   
Figure5.6Performanceof VariousApproachesonEV6-Linux
59







· ¸ ¹ º
» ¼ ½ ¾ ¿ À Á








ï ð ñ ò
ó ô õ ö ÷ ø ù ú û ü
ý þ ÿ        
  	 

        
   





mancethanthestate-of-artimplementationsis offeredby therelatively simplestrategies
thatarerefinedandsynthesizedfrom severalrecentresearchapproaches.Thiswasdemon-
stratedby thepracticalbenchmarkexperimentson a modernprocessorfamily with well
definedperformancemetrics.
This thesisrecognizedthat thearchitecturalconstraintsandissuesrestrictthecritical
pathandoptionsavailablefor optimalperformance.Specifically, TLB misses,inter-cache
peakbandwidth,and processorcomputationresourcesevidently reducethe numberof
choicesavailableto achieve theoptimalperformance,This thesisalsorecognizedthatthe
peakkernelefficiency doesnot necessarilylead to the bestoverall performance.Also,
efficient, flat performanceis possibleat all problemsizesthat fit in main performance,
ratherthan“jagged”performancecurvesobservedin theconventionalapproaches.
Thefirst chapterpresentedthehypothesis,motivationsandthepresumptive contribu-
tions of this study. The motivation of this thesisarosemainly from the previous obser-
60
61
vationsregardingthe performanceof differentarchitecturalcomponentsandthe overall
performanceof matrixmultiplication.
Theliteraturereview wasdonein thesecondchapter. Thebackgroundof matrixmulti-
plicationandits importancerelatedto thelinearalgebralibrarieswererefreshed.Thehigh
performanceimplementationof densematrix multiplication is challengedby the com-
plexity of the computerarchitecture,thusthe architecturalfeaturesthat arefundamental
to the deliveredperformancewere also studiedin details. At last, the secondchapter
walkedthroughthevariousrelatedresearchwork, which includetheconventionalblock-
ing algorithm,theself-tuningstrategy, thecache-obliviousalgorithm,thehierarchicaldata
structurebasedframework, andthenew variantsof theblockingalgorithm.
It wasconcludedfrom the relatedwork that a combinationof recursionstrategy and
iterationstrategy is necessaryfor betteroverall performance.Thethird chapterdescribed
sucha framework with a qualitatively analysisof theperformancetradeofs madeby dif-
ferentoptimizationmethods.This framework consistedof a hierarchicalstorageformat
that utilizes the memory-friendlyMorton Ordering,an iterative algorithm,andan opti-
mizedinnerkernelfor theAlpha21264processor.
In order to carry out an an unbiasedperformanceevaluationof the interestedim-
plementations,user-level performancemetricsweremathematicallydefinedin the fourth
chapter, in whichtheintegrationandtuningmethodologyof thisstudywasalsodescribed.
Thisstudyfinally comparedandcontrastedthehierarchicalframework basedmethodsand
TLB-blocking basedmethodsthat employ the similar inner kernel routinesin the fifth
62
chapter. A performancecomparisonwith the referencelibraries (ATLAS and CXML)
werealsoofferedontheAlpha21264platform.Theexperimentalenvironmentwasspecif-
ically constructedto obtaintrustworthyvaluesof thepredefinedperformancemetrics.The
benchmarked resultsclearly showed that the approachof this studyachievedhigherand
flatter performancethanthe self-tuninglibrary andthe vendorlibrary andvalidatedthe
hypothesis.
6.2 Future Work
This thesishasemployed a qualitative model to explain the relationshipof perfor-
mancecomponentsandtheoverallperformanceanddemonstratetheperformancetradeof
beingmadeby differentapproaches.It would be moreinterestingto explore the possi-
ble quantitativemodelingof theperformance-criticalconstraintsfor discoveringpotential
performanceimprovementsandpolyalgorithmictradeofs. Also, it remainsvaluableif the
knowledgeslearnedin thisstudycanbetransferredto theusefulpointersfor theprocessor
vendors. For example,the specificdesignstrategiesfor building the next generationof
highperformancearchitecturescanbenefitfrom suchastudyonthefundamentalarchitec-
tural constraintsandhow they affect theperformanceof critical applications.
Understandinghow thesemethodologiesimpact rectangularmatrix multiplications,
parallel matrix multiplication, and algorithmic storagetradeof (filling the matricesvs.
multiplying them)areall valid, interestingnext steps.
REFERENCES
[1] “ATLAS Errata,” http://math-atlas.sourceforge.net/errata.html#gcc3.0Date of ac-
cess:February20,2003.
[2] “ATLAS FAQ,” http://math-atlas.sourceforge.net/faq.html#authDate of access:
February20,2003.
[3] “Automatically Tuned Linear Algebra Software (ATLAS),”
http://www.netlib.org/atlas/Dateof access:February20,2003.
[4] “Compaq math library,” http://h18000.www1.hp.com/math/new/. Date of access:
June15,2003.
[5] “Earth Simulator,” http://www.es.jamstec.go.jp/esc/eng/index.html. Dateof access:
February20,2003.
[6] “GCC Homepage,” http://gcc.gnu.org/ Dateof access:February20,2003.
[7] “KazhusigeGoto’sMath Library,” http://www.cs.utexas.edu/users/flame/goto/.Date
of access:February20,2003.
[8] “PersonalCommunication,” KazushigeGoto,March,2003.
[9] “Sandpile.org, Theworld’s leadingsourcefor puretechnicalx86processorinforma-
tion,” http://www.sandpile.org/ Dateof access:February20,2003.
[10] “Top 500 SupercomputerSites,” http://www.top500.org/ Dateof access:February
20,2003.
[11] AMD Athlon Processorx86 CodeOptimizationGuide, AdvancedMicro Devices,
Inc, 2002.
[12] R. Agarwal, F. Gustavson, and M. Zubair, “Exploiting FunctionalParallelismof
POWER2 to DesignHigh-PerformanceNumericalAlgorithms,” IBM Journal of
Research andDevelopment, vol. 38,no.5, September1994,pp.563–576.
[13] E.Anderson,Z. Bai,C.Bischof,J.Demmel,J.Dongarra,J.Du Croz,A. Greenbaum,
S. Hammarling,A. McKenney, and D. Sorensen, “LAPACK: A PortableLinear
AlgebraLibrary for High-PerformanceComputers,” ProceedingsSupercomputing
’90, IEEE ComputerSocietyPress,Los Alamitos,California,1990,pp.2–11.
63
64
[14] E. Anderson,Z. Bai, J.Demmel,J.E. Dongarra,J.DuCroz,A. Greenbaum,S.Ham-
marling,A. E. McKenney, S.Ostrouchov, andD. Sorensen,LAPACK Users’ Guide,
SIAM, Philadelphia,1992.
[15] J. Bilmes, K. Asanović, C.-W. Chin, andJ. Demmel, “Optimizing Matrix Multi-
ply Using PHiPAC: A Portable,High-Performance,ANSI C codingmethodology,”
Proceedingsof theInternationalConferenceon Supercomputing, July1997.
[16] R. D. Blumofe,M. Frigo, C. Joerg, C. E. Leiserson,andK. H. Randall, “An analy-
sisof dag-consistentdistributedshared-memoryalgorithms,” In Proceedingsof the
Eighth AnnualACM Symposiumon Parallel Algorithmsand Architectures(SPAA),
Padua,Italy, June1996,pp.297–308.
[17] S. Chatterjee,V. Jain, A. Lebeck,S. Mundhra,and M. Thottethodi, “Nonlinear
Array Layoutsfor HierarchicalMemory Systems,” Proceedingsof the 13th ACM
InternationalConferenceon Supercomputing(ICS’99), June1999.
[18] S.Chatterjee,A. Lebeck,P. Patnala,andM. Thottethodi,“Recursive Array Layouts
andFastParallelMatrix Multiplication,” Proceedingsof the11thACM Symposium





[21] J.J.Dongarra,J.R.Bunch,C.B. Moler, andG.W. Stewart, LINPACK Users’ Guide,
SIAM, Philadelphia,1979.
[22] J. J. Dongarra,J. Du Croz, S. Hammarling,andI. Duff, “A Setof Level 3 Basic
Linear AlgebraSubprograms,” ACM Transactionson MathematicalSoftware, vol.
16,no.1, Mar. 1990,pp.1–17.
[23] J. J. Dongarra,J.Du Croz,S. Hammarling,andR. J.Hanson,“An ExtendedSetof
FORTRAN BasicLinearAlgebraSubprograms,” ACM Transactionson Mathemati-
cal Software, vol. 14,no.1, Mar. 1988,pp.1–17.
[24] M. Frigo, C. E. Leiserson,H. Prokop, and S. Ramachandran,“Cache-Oblivious
Algorithms,” 40th AnnualSymposiumon Foundationsof ComputerScience, New
York, NY, October.
[25] K. A. Gallivan, W. Jalby, U. Meier, and A. H. Sameh, “Impact of Hierarchical
MemorySystemson LinearAlgebraAlgorithm Design,” TheInternationalJournal
of SupercomputerApplications, vol. 2, no.1, 1988,pp.12–48.
65
[26] S. Goedecker andA. Hoisie, PerformanceOptimizationof NumericallyIntensive
Codes, Societyfor IndustrialandAppliedMathematics,Philadelphia,2001.
[27] G. GolubandC. V. Loan, Matrix Computations, Third edition, TheJohnsHopkins
UniversityPress,Baltimore,Maryland,1996.
[28] K. Goto andR. van de Geijn, On ReducingTLB Missesin Matrix Multiplication,
Tech.Rep.TR-2002-55,Departmentof ComputerScience,TheUniversityof Texas
at Austin,November2002.
[29] B. Greer, J. Harrison,G. Henry, W. Li, andP. Tang, “Scientific Computingon the
ItaniumProcessor,” SC2001, Denver, Colorado,November, IEEE andACM.
[30] J.A. Gunnels,G.M. Henry, andR.A. vandeGeijn, “A Family of High-Performance
Matrix Multiplication Algorithms,” ComputationalScience- ICCS 2001, Part I,
V. N. Alexandrov, J. J. Dongarra,B. A. Juliano,R. S. Renner, andC. K. Tan,eds.
2001,LectureNotesin ComputerScience2073,pp.51–60,Springer-Verlag.
[31] F. G. Gustavson, “New GeneralizedDataStructuresfor MatricesLeadto a Variety
of High PerformanceAlgorithms,” Proceedingsof the IFIP TC2/WG2.5Working
Conferenceon theArchitectureof ScientificSoftware, October2000.
[32] J.L. HennessyandD. A. Patterson,ComputerArchitectureA QuantitativeApproach,
Secondedition,MorganKaufmannPublishess,Inc, SanFrancisco,California,1996.
[33] G. Hinton, D. Sager, M. Upton,D. Boggs,D. Carmean,A. Kyker, andP. Roussel,
“The Microarchitectureof thePentium4 Processor,” Intel TechnologyJournal, 2001
Q1.
[34] IA-32 Intel ArchitectureSoftwareDeveloper’sManual, Intel Corporation,2002.
[35] B. L. JacobandT. N. Mudge,“A Look atSeveralMemoryManagementUnits,TLB-
Refill Machanisms,andPageTableOrganizations,” In theProceedingsof theEighth
InternationalConferenceonArchitectural Supportfor ProgrammingLanguagesand
OperatingSystems(ASPLOS-VIII), SanJose,California,1998.
[36] B. Kågstr̈om, P. Ling, and C. V. Loan, GEMM-BasedLevel 3 BLAS: High-
PerformanceModel Implementationsand Performance Evaluation Benchmark,
Tech. Rep. UMINF-95.18, Departmentof ComputingScience,Ume̊a University,
Sweden,1995.
[37] V. Kumar, A. Grama,A. Gupta,andG.Karypis, Introductionto Parallel Computing,
Secondedition, BenjaminCummings/AddisonWesley, 2001.
66
[38] M. S.Lam,E. E. Rothberg, andM. E. Wolf, “The CachePerformanceandOptimiza-
tionsof BlockedAlgorithms,” Proceedingsof theFourth InternationalConference
on Architectural Supportfor ProgrammingLanguagesandOperating Systems(AS-
PLOSIV), April 1991.
[39] C. L. Lawson,R. J.Hanson,D. R. Kincaid,andF. T. Krogh, “Basic LinearAlgebra
Subprogramsfor FortranUsage,” ACM TransactionsonMathematicalSoftware, vol.
5, no.3, Sept.1979,pp.308–323.
[40] H. Prokop, Cache-ObliviousAlgorithms, master’s thesis,Departmentof Electrical
EngineeringandComputerScience,MassachusettsInstitutionof Technology, Cam-
bridge,Massachusetts,May 1999.
[41] R. Schreiberand J. Dongarra, AutomaticBlocking of NestedLoops, Tech.Rep.
CS-90-108,Departmentof ComputerScience,Universityof Tennessee,May 1990.
[42] P. Strazdins, “TransportingDistributed BLAS to the Fujitsu AP3000and VPP-
300,” Proceedingsof the Eighth Parallel ComputingWorkshop(PCW’98), Singa-
pore,September1998,Schoolof Computing,NationalUniversityof Singapore,pp.
69–76.
[43] V. ValsalamandA. Skjellum, “A Framework for High-performanceMatrix Multi-
plicationBasedon HierarchicalAbstractions,AlgorithmsandOptimizedLow-level
Kernels,” ConcurrencyandComputation:PracticeandExperience, vol. 14,no.10,
2002,pp.805–840.
[44] S. P. VanderWiel andD. J. Lilja, “When Cachesaren’t Enough:DataPrefetching
Techniques,” IEEE Computer, vol. 30,no.7, 1998,pp.23–30.
[45] K. R. WadleighandI. L. Crawford, Software Optimizationfor High-Performance
Computing, Prenticehall, UpperSaddleRiver, New Jersey, 2000.
[46] R. C. Whaley, A. Petitet,andJ. J. Dongarra,AutomatedEmpirical Optimizationof
Software and the ATLASproject, Tech.Rep.UT-CS-00-448,Departmentof Com-
puterScience,Universityof Tennessee,Knoxville, Tennessee,September2000.
[47] D. Wise andJ. Frens, Morton-order MatricesDeserveCompilers’ Support, Tech.
Rep.533,Departmentof ComputerScience,IndianaUniversity, Bloomington,Indi-
ana47405-4101,U.S.A.,November1999.
[48] M. E. Wolf andM. S. Lam, “A DataLocality Optimizing Algorithm,” Proceed-
ingsof theACM SIGPLAN1991ConferenceonProgrammingLanguageDesignand
Implementation, June1991.
APPENDIX
GLOSSARY
ATLAS
AutomaticallyTunedLinearAlgebraSoftware
BLAS
BasicLinearAlgebraSubprograms
CISC
Complex InstructionSetComputers
CXML
CompaqeXtendedMath library
DGEMM
DoubleprecisionGEneralMatrix Multiplication
DLP
DataLevel Parallelism
DoF
Degreeof Flatness
FLOPS
FLoating-pointOperationsPerSecond
67
68
GCC
GNU CompilerCollection
GEMM
GEneralMatrix Multiplication
HERO
HierarchicalExtensionof RecursiveOrdering
HDS
HierarchicalDataStructure
ILP
InstructionLevel Parallelism
LAPACK
LinearAlgebraPACKage
MFLOPS
Millions FLoating-pointOperationsPerSecond
OoO
Out-of-Orderexecution
OS
OperatingSystem
RISC
ReduceInstructionSetComputers
SIMD
69
SingleInstructionMultiple data
SSE
StreamingSIMD Extensions
TLB
TranslationLookasideBuffer
VLIW
VeryLong InstructionWord
