Simulation of Lattice QCD is a challenging computational problem. Currently, technological trends in computation show multiple divergent models of computation. We are witnessing homogeneous multicore architectures, the use of accelerator on-chip or o-chip, in addition to the traditional architectural models. On the verge of this technological abundance, assessing the performance trade-os of computing nodes based on these technologies is of crucial importance to many scientic computing applications. In this study, we focus on assessing the eciency and the performance expected for the Lattice QCD problem on representative architectures and we project the expected improvement on these architectures and their impact on performance for Lattice QCD. We additionally try to pinpoint the limiting factors for performance on these architectures.
1 Introduction untum hromodynmi @ghA is the theory of strong intertion in the domin of sunuler physisF vttie gh @vghA is numeril method sed on gh (rst priniplesD the only one le to ompute relily mny quntities of high sienti( relevneF st is sed on disretiztion of spe time nd wonteEgrlo methodF he system eing n extremely omplex one nd the numer of degrees of freedom eing of the order of illion todyD numer promised to inrese in the futureD vgh needs hevyD e0ient nd hep enough omputing tools @hrdwre nd softwreAF he gol of the lultion is to produeD ording to given proility lw resulting from the theoryD wide sttistil smple of guge on(gurtionsD eh of whih eing lrge (le of omplex numersF elthough using e0ient lgorithms whih will e desried in the next setionD it requires very lrge mount of omputing powerF imultion of this theory is one of the grnd hllenge prolems in prt euse of the smll perentge of the omputtion lod usully eing oserved on most omputing rhiteturesF essessing the performne nd e0ieny on new rhiteturesD nd sed on di'erent lgorithmi representtion of this prolemD is importnt to get loser to the omputtionl power needed for this prolemF his omputtion tends to hve low utiliztion nd e0ieny on most generlEpurpose omputing filitiesD leding to ine0ient power onsumption nd unrelisti demnds on the numer of required omputtionl nodesF o fr uilding speil mhine for simulting the vttie gh prolem hs een widely used pproh ID PD QF he motivtion to uild speilized omputing filitiesD despite ll the ssoited overhedsD is the enormous omputtionl power needed in ddition to the speil hrteristis of the omputtion of the vttie ghF por this prolemD n optiml omputing node should o'er support for omplex rithmeti instrutionsD lrge register (leD swh instrutionsD softwre ontrolled he mngement nd lned memory ndE width nd ommunition ndwidth to the omputtionl powerF wultiple suessful lned designs were historilly uild for vttie gh ID PD QD ut the overhed of design nd mintenne is usully highF fuilding out of ommodity omponents tht est (t the prolem hrteristis is very ttrtive lterntiveD ut it requires reful nlysis of the prolemD together with the nlysis of the lrge spetrum of rhiteturl lterntivesF gurrentlyD the driving fores for omputer rhiteture push multiple tehnologies with no ler onE vergene to performneGpower overll winnerF e intend to explore these tehnologiesD guided y the omputtionl requirements of the vttie ghF es it is extremely di0ult to exhustively experiment ll the emerging tehnologiesD we hoose to fous on two groups of tehnologiesX
• General-purpose homogeneous nodes: we desrie our e'orts to ssess the trdeEo's of implementing the vttie gh omputtion on di'erent rhiteturesF e minly foused on stnium nd entium rhiteturesF hese two rhitetures represent two mjor design lterntives in generl purpose omE putingF he isg rhiteture for stnium proessor relies on the ompiler in mnging instrution prllelismD while the superslr rhiteture for the entium proessor relies on hrdwre mngeE ment of instrution prllelismF foth rhitetures re deployed suessfully to uild highly slle mhinesF
• Heterogeneous computing nodes: we investigte the use of speilized elertors to improve the performne of omputing nodeF he (rst lterntive we explored is the use of sntel eon proessor ssisted y qVH xshse grphi rd s n elertorD heterogeneous multiEhip lterntiveF he seond lterntive we explored is sed on integrting elertors with the min proessor on hipD providing heterogeneous systemEonEhip kind of rhiteturesF e good representtive of this rhiteture is the sfw gell rodnd engineF he ojetive of this study is to ompre future tehnologies prospet for the simultion of vttie ghF e do not seek generlized omprtive study of ll future rhiteturl trendsY we trget the omprison sed on the requirements for the simultion of the vttie gh omputtionsF yur study revels tht the performne of the vttie gh omputtion n e gretly improved using speilized elertorsF wore importntlyD we predit tht the imlne of the omputtionl power to ommunition ndwidth for the vttie gh will remin n ostle for ll the studied rhitetureF i0ient usge of the omputtionl power will rely hevily on the level of expliit resoure mngement tht prtiulr hrdwre will o'erF he rest of this pper is orgnized s followsX etion P introdues the vttie gh prolem nd its physil formultionF etion Q introdues nlysis of the performne of single node sed on vrious rhiteturl lterntivesF etion R disusses the needed improvements in the performne of the disussed rhitetures nd ontrst it with their expeted or plnned evolutionF etion S disusses the performne impt of the ommunition rhiteture for lrge sle systemF etion T onludes this workF 2
The Problem of LQCD sn vttie ghD the fourEdimensionl speEtime ontinuum is simulted y fourEdimensionl lttieD of length respetively X, Y, Z, T in the four diretionsD with qurk quntum (elds on eh lttie site nd gluon quntum (elds represented y @QA mtries on eh link etween these sitesF @QA refers to 3 × 3 unitry mtries of omplex numers of unit determinntF he QEdimensionl spe in whih these mtries t is referred to s the spe of the three qurk olours4F he spinors re represented y four @QA vetorsD eh omposed of three omplex vrilesF he lultion ims t omputing the verge vlues of physil quntitiesD whih re funtions of these (eldsD ording to proility distriution lso depending on the (eldsD nd derived y disretiztion proedure from the si gh vgrnginF his verge is tken over the full spe of ll the possile vlues of the (eldsF his spe is known s the (eld on(gurtion speF he integrtion of qurk (elds is done formllyD leding to omplited nonElol proility distriutionX the determinnt of the very lrge hir opertor mtrixD whih depends only on the gluon (eldsF he proility lw is desried y omplited expression depending only on the gluon (elds iFeF the @QA mtriesF e ll guge on(gurtion set of @QA mtrie mtries de(ned on ll linksF he proility lw is thus de(ned in the spe of guge on(gurtionsF por lrge ltties the spe of guge on(gurtions is vriety with dimensionlity of the order of illionsF ynly wonteEgrlo method llows suh huge lultionF o estimte the verge vlues of the physil quntities we need representtive smples of guge on(gurtions @sy out SHHHA for every set of prmetersD generted ording to the oveEmentioned proility lwF he ryrid wonteEgrlo @rwgA lgorithm RD or vrints of it suh s the olynomil rwg @rwgAD the tionl rwg @rwgAD is used to generte these smplesF his is very hevy lultionF sn the following disussionD we will onsider n rwg implementtion hieved y the iwg ollortion SD TF he run is deomposed into trjetoriesD whih re indeed trjetories of omplex dynmil system depending on the @QA mtriesF ih trjetory leds from one guge on(gurtion to the next one of our smpleF he rmiltonin of this system is devised in suh wy s to generte guge on(gurtions with lrge enough proilityF et the end of the trjetory wetropolis test ensures the orret proility lwF he trjetory is divided into stepsF efter every step the guge on(gurtion is updtedF huring the stepD the guge on(gurtion stys unhngedF he lgorithm mnipultes ojets nmed ilson spinorsF yne ilson spinor is omposed y spinor @IP dimensionl omplex vetorA on every lttie siteF huring the stepD whihever vrint of the lgorithm eing usedD there is n itertion of the multiplition of lrge ilson spinor y the lrge mtrix nmed hir opertor leding to n output ilson spinorF his prt is liner lger nd it is the most timeEonsuming prt of the lgorithmF he multiplition of the ilson spinor y the hir ypertor is minly performed in the iwg ode y routine nmed roppingwtrixD whih is ontriuting out WH7 of the totl exeution time UF ilson spinors s well s the guge on(gurtions re very lrge rrysF es we shll see the mjor prolem to produe e0ient omputtions is to ensure fst enough dt trnsfer to nd from the omputing unitsF st is worth notiing tht the stility of the guge on(gurtionD during very mny itertions of the multiplition of ilson spinors y the hir opertorD allows a signicant reduction of the data to be transferred if one manages to keep the SU(3) matrices in some kind of fast access memory close to the computing unitsF his is not esy in generl euse the guge mtries onstitute rther lrge rrysF he multiplition of the ilson spinor y the hir opertor is expressed in formul @IAX the tions of the hir opertor involves sum over qurk spinors @ψ(i)A multiplied y gluon (eld @U µ (i)A through the spin projetorF
where κ x = κ y = κ z = 1 nd κ t = exp iπt/T whih expresses ntiEperiodi oundry onditions in the time diretionF he gluon (eld @QA mtries re lelled y their strting site nd the speEtime diretion of the link on whih it is de(nedF he ode we onsider ontins two vrints of the lgorithmF he (rst one nmed fullEspinor orresponds to the diret pplition of eqution @IAD while the seond nmed hlfEspinor proesses vi two phses whih n e expressed y the following set of equtionsX pirst phse @u seriesAX
he pros nd ons of oth vrints depend on the rhiteture nd will e disussed lterF sn the next setions we will present studies on numerous rhitetures nd lttie sizesF yur generl gol is the et)op s justi(ed in setion SF orking on one node we hve in mind di'erent sultties ording pigure IX gomputtion shemes of the roppingwtrix routine sed on the fullEspinor nd the hlfEspinor versionsF to di'erent possile deompositions of the full lttieF e lso vried the lttie size in order to highlight the role of the di'erent rhiteturl omponents @eFgFD he sizeD etAF 3 The Performance of a Single Computing Node he performne of lttie gh on multiproessor mhine relies hevily on the performne of the individul omputing nodesF sn this setionD we will strt y outlining the rhiteturlly independent ttriutes of the roppingwtrix routine tht will intert with the rhitetures under investigtionF e will then disuss the performne sed on these individul nodes in seprte setionsF sn the roppingwtrix routineD the omputtion of the spinor involves ITHV 2 )otingEpoint opertions per lttie site touhing QTH )otingEpoint vrilesF e fous on the two implementtions lredy menE tionedD the fullEspinor nd the hlfEspinor onesF he fullEspinor version pulls ll the dt needed to ompute n output spinor from ll surrounding sitesF hese dt inlude the guge (eld links nd the spinorsF sn eh ll of the roppingwtrix routineD the guge links show nonEredundnt regulr essD while reds of the surrounding spinors usully rry redundny nd irregulrity of ess euse eh input spinor ppers in the omputtion of eight di'erent output spinorsF o solve this prolem in essing dtD the hlf spinor pigure PX roppingwtrix is splitted into two phsesD useries @phse oneAD then vseries @phse twoA with lmost lned odeD dtD g timeF version rries the omputtion in two phsesF sn the (rst phseD eh input spinor is visited one nd the omputtions relted to ll the surrounding spinors re pushed to the surrounding spinors in intermedite hlfEspinor struturesF riting of the output hlfEspinors is ligned to optimize the ess pttern in the seond phseF sn the seond phseD the results of the (rst phse re used to ompute the output spinorsF he ess of the hlfEspinors intermedite struture is more regulrF he dvntge of the hlfEspinor version is tht irregulr pttern of ess is ssoited with the writes of the (rst phse nd not with dt redsF sn most generlEpurpose rhiteturesD memory reds re more ritil to performneF yn the other hndD the essed dt re inresed y out U7 for the hlfEspinor version ompred with the fullEspinor versionF pigure I shows four ode vrints of the ode explored in this studyF he omputtion n e deomposed into two phses of omputtionD in the hlfEspinor versionF edditionllyD the omputtion n e further deomposed sed on the numer of spe diretionsF pigure P depits the two min phses of the hlfEspinor omputtion tht llow friendlier he ehvior on proessors with he hierrhyF por proessors with norml hesD e.g. stnium nd entiumD we will fous on the hlfEspinor version euse of its performne dvntgeD for omputing nodes using elertors we will explore oth tehniquesF he most dominnt ttriute of roppingwtrix omputtion tht 'ets performne is the low omE puttion to memory ess rtioD s shown in pigure QF his nlysis ssumes no temporl lolity in interElls to the roppingwtrix routineF his ssumption is vlid tking into ount the lrge footprint ssoited with resonle lttie sizeD in ddition to the lterntions in the omputtion etween multiple input dt on onseutive lls to the routineF his omputtion to memory ess rtio does not exeed IFHS doule preision )oting point opertions per yteF his rtio is usully s low s HFST pGyte if the lttie dt is not hedF sn ontrstD reuse of dt is relted to the dt size for dense mtrixEmtrix multiplitionD though it is prtly exploited using loking due to limited he sizesF ghing the lttie dt is di0ult to hieve euse we tend to hoose igger lttie to mitigte the ost of ommunition etween nodesF he vttie gh prolem requires dividing the lttie mong mny ooperting nodesD whih need to ommunite resultsF o overome the disprity etween the ommunition ndwidth nd the omputtionl power of the proessing nodeD we tend to inrese the sulttie size per nodeF he omputtion grows linerly with volume while the ommunition grows linerly with the surfeF pigure Q shows tht improving @inresingA the omputtion to ommunition rtio linerly requires exponentil growth in the required memory speF enother worth noting ttriute is tht the guge (eld onstitutes out US7 of the dt essed @stti memory footprintA ompred with IPFS7 for input spinorsF et runtime for the fullEspinor versionD the input spinors vetor represents SS7 of the dynmilly esses dt with the lest regulr ess pttern euse of the spinEprojetion opertorF por the hlfEspinor versionD most of the essed dt t runtime elongs to S pigure QX ummry of the ttriutes of the roppingwtrix routine in terms of memory requirementsD density of omputtion to ess nd ess ptternF the intermedite dt strutures rrying the hlfEspinor dtF sn the following susetionsD we present our study (rst on the use of homogenous omputing nodes sed on entium nd stnium proessorsD thenD on the use of heterogeneous nodes where generl purpose proessor is ssisted y speil elertor to speedup )oting point omputtionsF e spei(lly present our study on the use of xvidi q ssisting n sntel eon proessor nd the use of the sfw gell fiF 3.1 Baseline Code -Pentium4 sn order to hve se of omprisonD the performne of the rwgGiwg ode is mesured on n sntel eon resott proessor t QFPqrz with ITuf vI nd Iwf vP @using one oreAF he rwgGiwg ode omes with n optimized version for i swh instrutionsF hese re vetor instrutions le to perform P doule preision opertions in one single instrution issueF es mtter of ft the use of swh i entium instrutions is expliit in the odeD y the use of dedited intrinsisF st silly ddresses the omputtions on the spinor vetors @omplex vetors of size QAF erformne hve een mesured on two input dt setsF he (rst one is lttie of size 4 4 F he seond one is lttie of size 16 3 × 32F erformne results re desried in pigure R for two versionsD with nd without iF pirstD sed on the numer of )oting point opertions t eh lttie site @see etion QA E ITHV opertions per site ED the lok yle ounter nd the frequeny of the entium @QFPqrzAD we found tht the speed of the originl ode is out PFQ qplops for the 4 4 lttie nd IFS qplops for the 16 3 × 32 lttie when using the i instrutionsF his mens tht the originl ode entium version is lredy highly optimized @pek performne is TFR qplops in doule preisionAF he etter performne for the smller lttie is proly due to dt reuse tht nnot e exploited so well with the lrge sizeF 3.2 Itanium Architecture e evlute performne of the originl rwg ode on n sntel stnium wontvle proessor t IFTU qrzD with PSTuf vP nd IPwf vQ @on single oreAF sn the originl versionD Hopping_Matrix runs one loop for eh of the two hlfEspinor omputtionD one loop for ll diretions on the odd sitesD then one loop for ll diretions on the even sites @pigure IFfAF he originl ode su'ers from two min de(ienies on stnium rhitetureX
• enlysis of the ompiler generted ssemly ode V shows tht the ompiler hs di0ulty optimizing the whole si lok of the loopF oo mny instrutions nd too high register pressure prevent the ompiler for instne to softwre pipeline the loopF his hs high impt on stnium rhitetureF pigure SX erformne in qplops of originl ropE pingwtrix nd tiled version on IFTqrz wontvle stniumP for di'erent lttie sizesF
• hile some dt @in prtiulr the guge (eldsA re reused through the omputtionD the size of the volume prevents dt from stying in he etween two usesF xote tht there is no swh ode for stniumD unlike for entiumD the ode onsidered is plin g odeF his leds to two trnsformtionsX eh loop of the two phses is tiled so tht dt within tile stys in heD nd the tiled loop is split for ll diretions of omputtion in order to enhne the qulity of ompiled odeF pigure IFh shows the struture of the resulting odeX eh of the two prllel loops re tiled nd the hlfEomputtions orresponding to eh diretion within eh of these loops re exeuted sequentillyF pigure SD on the leftD shows performne of the two versions wFrFtF lttie sizesF por smll lttie of size 4 4 D ll the dt (t in vP heD tiling only introdues overhed nd performne rehes 3.2 qplops @pek performne is t TFR qplops in doule preisionAF por medium size lttie of size 8 3 * 16D the dt is still in vQ ut no longer in vPF here is light performne improvement of the tiled versionD tht nerly rehes 3 qplopsF pinllyD when the lttie is too lrge even for vQ heD the tiled version outperforms the originl ode y 60%F he e'et of tiling redues the impt of vQ he misses ut s the reuse ftor is low @etween P nd QA nd does not grow with the lttie volumeD memory esses still drive the performne for lrge enough lttiesF he overll performne gin for the whole rwg ode rehes 40% speed up for 16 3 * 32 lttieF he est tile size for the rhiteture hs 128 itertions nd orresponds to mximum usge of the he hierrhyF 3.3 Computing Node Based on CPU Assisted by a GPU Accelerator he use of qs for sienti( omputing is urrently under investigtion y mny reserh groups nd gives very promising results for mny sienti( pplitions WF e investigted the use of xshse qVH q grphi rd q ord hosted on server of qudEproessor sntel eon proessor t IFVTqrzD Rwf vP heF he xshse qVHis omposed of IT multiproessors tht re interonneted to nked hews through n impressive ndwidth of UVFT qfGsF pigure T depits the lyout of onneting q elertor to g through gs ixpress usF he prlleliztion proess for qs is trditionlly done sed on vendor ompilerF ixpressing prolem is filitted with the dvent of generlEpurpose progrmming tehnologyD suh s ghe IH y xshse IIF iven though prlleliztion is done through the ompilerD the progrmmer rries the responsiility of trnsforming the ode in wy tht enles e0ient prllelismF e mustEdo trnsformtion is to remove ontrolE)ow instrutions whenever possileF por ontrolE)ow vriles with limited outomesD lookup tles U pigure TX gomputing node sed on g ssisted y q elertorF n e used or redundnt omputtion in onjuntion with msking to introdue ontrolE)ow free odeF hese tehniques n prove e'etive in ssisting the genertion of swh opertionsF sn our implementtionD we explored the e'et of work grnulrity on performneF he grnulrity one thred impts on the resoures lloted to this thredD in prtiulr onerning the numer of registers @VIWP for xshse VVHH qAF epprentlyD the gud ompiler tries not to redue the mount of prllelism elow TR thredsD ssigning t most IPV physil registers per thredF emong lterntives exploredD we tried the hlfEspinor version nd the full spinor versionF por oth versionsD either eh thred rries the responsiE ility of the whole omputtion of n output spinor @orseEgrined implementtionA or the omputtion is divided mong IT threds of omputtionD sed on the numer of the dimension of the speF ih thred itertes through multiple sites of the output spinor rryF he orser the thred omputtion the more the stress on the resoures euse more resoures re needed to redue the pressure on memoryF qiven tht the memory ess lteny on q in the rnge of RHHETHH ylesD it is neessry to redue the memory ess frequenyD espeilly sine the hing within the q is severely limited in sizeF por roppingwtrix omputtionD we notied the less the grnulrity of the work ssigned to the thred the etter the performne hievedF pigure U shows the e'et of grnulrity hoie on performneF xumer of threds per multiproessor is set to TR @higher numer fil to lunh for orseEgrined tsks euse of the exessive resoures requiredAF e experimented the two versions of the omputtion hlfEspinor nd fullEspinorD disussed in etion QF por orseEgrined versionsD even with the inresed memory pressureD the hlfEspinor version @two threds per spinor omputtionA provides etter performne ompred with the fullEspinor @one thred per spinor omputtionAF hen the spinor omputtion re split mong IT threds @of (neEgrnulrityA for oth the hlfEspinor nd the fullEspinor tehniquesD the fullEspinor version eome etter thn the hlfEspinor version euse the former hs less frequeny of essing the memoryF he ndwidth of the host g memory to the q memory hs ritil impt on performneF elthoughD the guge (eld is onstnt ross itertions @need not to e exhngedAD the input nd the output spinors onstitute PS7 of the dt essed in the omputtionF woving these spinors dt k nd forth etween the q nd the host proessor is of signi(nt ost on performneF he low density of omputtion ompred with exhnged dt uses the ommunition overhed etween the q nd g to mount to RH7 of the totl exeution time even for lrge lttie size of 32 3 × 32F o improve the rtio of )oting point opertions to dt exhngedD we lloted some of the rrys tht re used to hold the intermedite spinor omputtion to the q memoryF sing this tehniqueD we redued the dt exhnge frequeny to one fourth nd the ontriution of dt ommunition to totl exeution time is lowered to II7F his tehnique requires ommuniting the omputed spinors less frequently in multiEnode implementtionF pigure V shows tht inresing the sulttie size improves the performne up to ertin extentF sing intermedite spinor rrys to do multiple step of the roppingwtrix redues the spinor memory exhnge pigure VX erformne sling @in single preisionA of the omputtion for multiple sulttie sizesF overhed etween the g memory nd the q memoryF he performne hieved is out TFP q)ops in single preisionF he e0ieny of vttie gh omputtion on qs is in the rnge of QER7 of the pek performne euse of the low reuse of dt nd the omplexity of the dt ess pttern tht inreses on)iting esses of the q memory nksF 3.4 Computing Node Based on the Cell BE yn the gell fiD we explored gin oth the hlfEspinor nd the fullEspinor implementtions of the ilsonE hir opertorF wo dt lyouts re onsideredX smll ltties tht n (t in the lol store nd lrge ltties tht re stored in the min memory nd hweed to the lol store in hunksF gell rodnd engine @fiA is unique rhiteture in integrting speilized elertor proessorsD lled synergeti proessing element iD to the min owerg sed proessorF ih i hs limited memoryD lled lol storeD lrge register (le of IPV ITEyte registers nd speilized swh proessing elementF he hip integrtes h memory ontroller in ddition to plexsy ontrollerF his integrtion leds to ndwidth to the memory up to PSFT qfGsF pigure W outlines the min omponents of the gell fiF he gell fi is known for eing di0ult to progrm prtly euse of the detiled ontrol it gives to the progrmmer over memory mngement of the di'erent ddress spes of is nd the min memoryF peil hwe lls re usully required to ontrol dt trnsfersF rnsforming ode to perform e0iently in swh mode is n dditionl trditionl ostle to exploit i proessorsF elying on ompiler is n option tht is yet to mture for this kind of rhitetureF he limited lol store size nd its seprte ddress spe dd n dditionl dimension to the omplexity of essing the dtF es the dt ssigned to omputing node will not (t in the lol storeD suset of the dt needs to e rought to the lol store for proessingF wo options exist in ringing dtX he (rst divides the sulttie ssigned to the gell fi into further smller sultties with the possile dt exhnges etween isF he seond option divides the ompuE ttion into frmes of dtD where is do not need to ommunite dtF he (rst lterntiveD whih will e presented with the hlfEspinor implementtion in this workD hve the potentil of reduing the pressure on the memory ndwidthF yn the other hndD it requires frequent ommunition etween the is nd more synhroniztion of pointsF he seond lterntiveD whih will e presented with the fullEspinor impleE W pigure WX gomputing xode sed on elertor on hipD represented y sfw gell fiF menttionD requires less synhroniztion nd dt exhnge etween isD ut my su'er from some lost opportunities in reusing dt essed y the neighoring isF he little reuse of the dt in the vttie gh omputtion enourges mking this trdeEo'F 3.4.1 Implementation based on Half-spinor version ell Q ojets @vetorsD mtriesA hve een trnsformed into REwy omplex h vetors nd RxR omplex mtries to llow for n esy swhiztion using intrinsisF his is ostly in dditionl )ops @PTST insted of ITHVA ut llowsD eyond diret mesurementsD simple grsp on di'erent senrios depending on the size of the lol storeF st is ssumed tht Ru sites re loted on eh nd IPVu sites on eh givvF houle u'ering is lwys used ross these senriosF hey re X I X very smll urrent lol store sizeD ll of ilson spinorD guge mtrix nd hlf spinors hve to e moved in or out toGfrom min memory for eh site etween oth phses Y P X hlf spinor will e kept in the lol store memory @or very lose to itA Y Q X the guge mtrix will e kept into v or round Y R X oth the hlf spinor nd guge mtrix n e lloted into lol store @the 9qolden gell9AF he outome is tht senrio I demnds ndwith vlue well ove the ville lol store to min memory ndwidth @QFP qfGseGspuAD leding to degrded performne @IFV qplopGseGspu insted of PFR for other senriosAF enrio Q is very interestingD even if it does require n extr e'ort out lol store size inrese X it is worth reminding tht the quge wtries remin onstnt over mny lls of the roppingwtrix routineF 3.4.2 Implementation based on the full-spinor version swhizing the ode requires ligning the dt in wy tht n e essed with the lest numer of dt shu1esF ih spinor is essed in eight di'erent ontexts @due to the spin projetion opertor in iqution @IA depending on the spe diretionF ih ess involves di'erent opertions nd memory ess pttern for the rel nd imginry prt of every omplex vrileF nfortuntelyD is do not support omplex rithmeti instrution setF hynmi memory esses of the input spinors onstitute SS7 of the dt essed s shown in pigure QD while it represents only IPFS7 of the stti dt essedF nfortuntelyD we nnot fuse these dt sttilly euse the sme spinor is essed in eight ontexts with di'erent surrounding spinors in eh seF o overome this di0ultyD we devise tehniqueD lled runtime fusionD tht fuses the input dt used for the omputtion of multiple onseutive spinorsF he rel prts of these input dt re fused into single registerD nd similrly for the imginry prtF por instne IT yte register requires fusing the omputtion of two output spinors of doule preision or four single preision output spinorsF pigure II shows the lyout pigure IHX ite dt )ows inside of roppingwtrix within the P phsesF he input @useriesA nd output @vseriesA spinor indexes re di'erent @di'erent sitesAD hene the relevnt quge wtries @site dependentA re lso di'erentF pigure IIX untime dt fusion tehnique for the fullEspinor version on gell fiF pigure IPX erformne vsF used i for ropE pingwtrix in single nd doule preisionF of the runtime dt fusion tehniqueF untime fusion merges the omputtion of unrolled loopD thus grouping the dt of similr ess pttern into IT yte wordsF he result of the omputtion is then sttered k into multiple spinors resultsF gell fi llows suh tehnique euse of the lrge register (leF elmost T uf of dt re touhed during the omputtion of group of two output spinors in doule preisionF his tehnique leds to performing the roppingwtrix routine with VH q)ops of single preision omE puttion nd VFU q)ops of doule preision omputtionF houle preision is not optimized in the urrent genertion of gell fiD ut owergell Vi with eh rries n optimized engine for the doule preision tht is ple of SH q)ops for the roppingwtrix routineF elisti lttie size needs to e stored in the min memory nd e retrieved in piees for proessingF he omputtionl power for single preision roppingwtrix would require RV qfGs of the memoryD fr eyond the PSFT qfGs ndwidth villeF he input spinor is redundntly essed V times during the omputtion of the input spinor rryF fndwidth n e svedD if nonEredundnt dt re rought to the lol store memory from the externl memoryD then the redundnt prt is onstruted inside the i lol store memoryF he sving in ndwidth n e PS7 for sulttie size of 16 3 × 16F e exploited ove ttriutesD in the pttern of spinor essD to hieve omputtion performne for the roppingwtrix of QIFP q)ops for single preision nd VFT q)ops for doule preisionF pigure IP shows the performne hieved while hnging the numer of the i usedF por single preision omputtionD II four i re le to deliver the mximum the hip n 'ordF sn ft the performne will slow down y S7 if ll the is re usedF he demnd of ndwidth of these R is mount to PQFS qfGsD i.e.D lmost sturting the ndwidth to the externl memoryF he sme ehvior is expeted for the doule preision with the new owergell Vi with enhned doule preisionF 4
Anticipated Future Evolutions and Comparisons sn this setionD we will try to disuss the expeted performne evolution for the studied rhitetures in the futureD for oth homogeneous generlEpurpose ore nd sed on elertorsF yur study shows tht the use of elertors n gretly help to oost the omputtionl performne of the min kernel routine for vttie ghF e will try to disuss the most importnt riterion tht will in)uene the hoie etween the studied elertor rhiteturesD for vttie ghF
Expected advances for Pentium/Itanium por the end of the yer PHHVD the next genertion of stnium rhiteture proessorD ukwilD nd eon proessor re expeted to integrte new memory ontrollerD nmed gommon ystem snterfeF his onE troller will o'er fst pointEtoEpoint proessor ommunition nd will hve pek interEproessor ndwidth of @up toA WT qfGs nd pek memory ndwidth of QR qfGs @(rst proessor re expeted to hve only ndwidth of round PR qfGsAF his would then e omprle to the urrent memory ndwidth of gell fi nd would improve performne for outEofEhe lttiesF he est performne otined for Hopping_Matrix is when ll the lttie (ts in he @vP nd vQAF por wonvle proessorD this orresponds to ltties up to the size of 8 3 × 16F ukwil is plnned to hve QHwf shred vQD for R oresF ithout ny hnge in the miroErhitetureD sustined Q qplopsGore would then e otined for ltties of 8 3 × 32F eny inrese in the future of he sizes would help to mintin high level of performneF ixperimentl results for whole multiEore proessorD tking dvntge of multiEore intertionsD re still to e otinedF i0ieny for entiumGstnium ode on one ore is s high s 50% for smller ltties @only onsidering Hopping_MatrixA ut for the whole odeD it is 18%F he e0ieny on multiore node of flueqeneG is y omprison of 16%D iFeF sustined q)ops performne 2.2 q)opsGnode @R oreGnodeAF Future prospects of the Cell BE he doule preision omputtion is improved on the new genertion gell ih engine @owergell ViAF imultion experiments show tht the gell ih is expeted to deliver IT q)ops of doule preision omE puttionF hree to four i will lso e le to sturte the ndwidth for doule preision euse no improvement to the ndwidth to the memory is introduedF en inrese in the lol store size n redue the pressure on the ndwidth y improving reuse of the dt rought to the iF he unused is n e turned o' thus sving powerF he performne of vttie gh odes on the gell fi would improve if the ndwidth to the memory is improved in the future genertions of the gellF he kernel routine implementtion n sturte up to doule the ndwidth for single preision omputtion on urrent genertion gell fiF por doule preision implementtion on gell ihD the roppingmtrix routine n sturte more thn triple the urrent memory ndwidth @VW qfGs re needed to oserve SH q)ops of doule preision omputtion on gell ihAF sf t one point of time these ndwidths re hievedD then omplex rithmeti instrutions would e needed to hieve more performneF Expected advances on the GPU o frD most qs lk e0ient support for doule preision omputtionF his is to e reti(ed in the ner futureF ixeption hndling for )oting point is lso not supportedF he ndwidth of dt exhnge etween the q nd the memory is in the verge of doulingF feuse of the dependeny of performne on this sre ndwidthD we do not expet tht hving multiple q onneted to the sme g northridge will e n e'etive solutionF he e0ieny of vttie gh omE puttion on qs is in the rnge of QER7 of the pek performne euse of the low reuse of dt nd the omplexity of the dt ess pttern tht inreses on)iting esses of the q memory nksF hese issues my require further investigtions for etter dt lignmentF eliility of the results otined y qs is mjor onernF qs historilly served grphi ppliE tions tht require high performne ut lso n tolerte some errors t runtimeF gertinlyD for sienti( omputing this unreliility is di0ult to retify t the lgorithmi levelF oftwre solution to unreliility usully results in loss of performneF Cell vs. GPU performance comparison emong the ftors dominting the performne tht n e hieved y ny omputing node for vttie gh re the ndwidths to the memory systemD nd the progrmmility of the omputing nodeF he est ndwidth oserved is urrently ssoited with integrting the memory ontroller on the die with the miroproessorF he low omputtion to memory ess rtio mkes the performne hevily relint on the memory ndwidthD espeilly for miroproessor ores with swh instrution setF he urrent ndwidth to memory winner is the gell fiY tht is why it delivers promising performne numersF he q performne is ounded y the low rtio of omputtion to dt trnsferX lrge volume of ommunited dt hs to pss through the ounded ndwidth etween the host memory nd the q memoryF enother hllenge is tht the irregulr pttern of essing spinors nnot e hndled e0iently when the jo of swhiztion is hnded to the ompilerF he ompute kernel performne for vttie gh usully relies on hndEoded optimiztions to hieve the most out of the experimented rhitetureF ixpressing the prolem in wy tht llows e0ient ompiler swhiztion requires more studyF he low e0ieny of omputtion on qs mkes the ttGq)ops rtio s high s PVF sn the owergell ViD vttie gh requires Q ttGq)ops for single preision nd out T ttGq)ops for doule preisionD ssuming none of the is is turned o'F Expected performance evolution and the Lattice QCD problem he performne of single node n inrese in the future genertion rhiteture euse of the hne of hving higher integrtion on single hipF por instneD the future genertion gell fi is expeted to hve more is per gell hip nd more multiproessor on the qsD nd more ores per hip for multiEore systemsF yur study leds us to elieve tht the e0ieny of utilizing the omputtionl resoures on ny of these future rhitetures will ontinue to e suEoptimlF he vttie gh reuse of dt is less thn verge pplitions tht most mnufturers lne their design forF singGdesigning omputing fility sed on ommodity omputing omponents n e used with vttie gh given tht enough resoure mngement is expliitly llowedF ixpliit mngement n llow using resoures sed on the lne needed for vttie ghD for instne y swithEo' omputing resoures not used euse of the memory ndwidth ottlenekF flning the resoures for omputing nodeD ndwidth to memoryD nd ommunition etween nodesD n e hieved sed on resoure mngement rther thn speil system designsF gurrentlyD for vttie ghD our study shows tht we nnot hieve less thn QET ttGq)opsD mening multi meg wtts for et)ops ple mhineF he needed performne for vttie gh requires generl tehnologil improvement in performne nd power onsumption s well s to filitte miroEtuning to inrese the e0ieny of hndling the spei(s ssoited with the vttie gh omputtionF IQ 5 Multi-node Systems he gol of lttie gh in the oming yers is to ompute rel gh iFeF with light qurks possessing the mss they hve in ntureF his mens typilly pion twie lighter thn usul present omputtions whih implies length twie lrger in physil unitsF o inrese the ury of the ontinuum limit nd to llow lultions with hevy qurks typil redution of the lttie sping y ftor P will e welomeF his leds to multiplition y R of the lengths in lttie unitsD iFeF sling ftor of PSTF trting from lttie of 32 3 × 64 we end up with 128 3 × 256F his is of ourse only rough estimteF e need to gin more thn two orders of mgnitude whih mounts indeed to et)ops sustined performneF he present stte of the rtD on fluegeneGD with the seline ode studied in this pperD rehes out PFP q)ops per qudriEore nodeD iFeF PP er)ops for ten rks @IHHHH nodesAF 5.1 The Eect of Communication Architecture on Performance imulting vttie gh with physilly meningful size requires the use of lrge numer of omputing nodesF por instne simulting 128 3 × 256 lttie requires VIWP nodes eh solving 16 3 × 16 sulttieF he ommunition etween the VIWP nodes is of ritil importne to the performneD espeilly when the omputing node performne is improved signi(ntlyF wny mhines uilt for vttie gh used Qh torus network for onneting the omputing nodes ID QF e urrent projet for gh speilized mhineD egi IPD ontinues dopting this network topology with omputing nodes sed on the owergell ViF he ommunition of the roppingwtrix follows the nerestEneighors ommunition ptternF ith the lrge volume of ontiguous dt ommunitedD this ommunition pttern relies mostly on the link ndwidth to determine the ommunition ltenyF essuming simple model for ommunition lteny given y the eqution communication latency = setup time + data size /bandwidthFhenD the ommunition lteny n e omputed esily ompred with the omputtion timeF he ommunition lteny depends on the ndwidthD s lrge setup time of I µs will ontriute less thn I7 of the ommunition ltenyF sn pigure IQD we present the ommunition s perentge of the omputtion time for the roppingwtrix routineF e did the omputtion ssuming multiple performne estimtes for the omputing nodes rnge etween I q)ops to IT q)opsF por simpliityD we ssumed tht the omputtion power will not vry gretly with the set of sulttie volumes experimented @rrying Vu to Iw spinorsAF he sustined link ndwidth is PSHwfGs per linkD whih is out the expeted sustined ndwidth from flue geneG interonnetion network @t SS7 of the pek ndwidthAF e hve three sulttie volumes eh with two struturesF por instneD the sulttie 4 3 × 128 is of the sme volume s 8 3 × 8D similrly for sultties 8 3 × 128 nd 16 3 × 16D nd sultties 16 3 × 256 nd 32 3 × 32F he omputtion to ommunition rtio is proportionl to the volume to surfe rtioF he equl edge sultties re fvored y the igger omputtion to ommunitionD ut would require Rh interonnetion network @IT unidiretionl links of PSH wfGs sustinedAF ultties with di'erent link size re wht we usully hve to emed the fourEdimensionl lttie into nodes interonneted with Qh topologyF essuming Qh interonnetion network for sulttie 4 3 × 128D pigure IQ shows tht hving node of IT q)ops will led to ommunition tht is IFW times the ompute timeF snresing the sulttie volume is one solution tht leds to inrese the requirement of the memory sustntilly s shown erlier in pigure QF por instne to derese the ommunition to SH7 of the ompute timeD we need to inrese the sulttie to 16 3 × 256 @requiring to ess to VHS wf in one roppingwtrix llAF e prtilly try to mth the physil memory to the dt essed in mssively prllel mhine euse hving virtul memory muh lrger thn the physil memory is penlized y the expensive sy essD espeilly for n pplition like gh tht strems the dt from the memory most of the timeD with little reuseF wost of the high performne nodeD like gell fi nd owerTD emed memory ontrolled on the hip nd n e onneted to limited physil memory @usully in the rnge HFS to P qfytesAF he ommunition n e ut to hlf if we dopt REdimensionl interonnetion networkD ssuming preserving the link ndwidthD similr to tht of the gh PD IQD requiring IT unidiretionl links per omputing nodeF Conclusion sn this studyD we presented the ttriutes hrterizing the min kernel routine for the vttie gh omE puttionF e dditionlly studied optimiztions nd ode trnsformtions needed for vttie gh on representtive set of rhitetures inluding generlEpurpose proessorsD like stniumD nd the use of omE modity )otingEpoint elertorsD suh s qs nd the gell fiF wost of the optimiztions presented in this work trget etter use of memory ndwidthD friendlier he ehvior nd e0ient use of vetor instrutionsD espeilly on elertorsF he performne rnges vried widelyD ut the use of elertors provided n ppeling potentil espeilly with the gell fiF here is lso promising potentil with q elertors if the ove mentioned improvements re introduedF xeither should one underestimte the potentility of homogeneous multiEore rhitetures with more ores nd lrge hesF he prospets re open nd the foreseele evolution will e very fstF he omputtion to memory ess rtio for the vttie gh omputtion is lower thn wht is 'orded y ll the studied rhitetures nd this trend is expeted to ontinue in the futureF erhitetures with expliit resoure mngement n llow more e0ient use of the resouresF e show tht the inresed performne of omputing nodes will inrese the need for hving higher perE formne interonnetion network whih ws trditionlly esily hievle for the vttie gh omputtionD ut will e the ritil issue in the futureF elgorithmi improvements suh s domin deomposition IRD whih inrese the omputtion to remote dt ess rtioD will e welomeF 
