Abstract-We present a novel hardware mechanism for dynamic program phase detection in distributed shared-----I--memory (DSM) multiprocessors. We show that successful do not necessarily work well in DSM systems, since they Address S --*-X-X-X lack the ability to incorporate the parallel application's global execution information and memory access behavior _ based on data distribution. We then propose a hardware S7M extension to a well-known uniprocessor mechanism that significantly improves phase detection in the context of DSM # Instructions multiprocessors. The resulting mechanism is modest in size since Last Branch Phase ID and complexity, and is transparent to the parallel application. Fig. 1 . Example of BBV phase detector.
I. INTRODUCTION . Introduce a new tool, the CoV curve, that helps quantify the quality of phase detection of a particular Analyzing the time-varying behavior of applications has mechanism across multiple operating points.
been the subject of several studies [5] , [7] , demonstrating that relying on average whole-program statistics can lead II. OVERVIEW to misconceptions about a program's actual behavior, and result in poor architecture optimization decisions. Yet in In phase-adaptive systems, a phase detector collects spite of such behavior changes over a program's entire exe-program statistics at runtime and, at regular sampling cution, application behavior is typically repetitive, and can intervals, determines whether the program incurred a phase be classified into distinct phases collections of dynamic change. This information is passed to a phase predictor, execution regions, not necessarily consecutive, exhibiting which infers the phase for the next sampling interval. Fisimilar behavior and thus requiring similar resources. In nally, a reconfiguration module tunes the system based on phase-adaptive systems, hardware phase detectors monitor this prediction, by trying different hardware configurations runtime metrics at the granularity of fixed sampling inter-at different intervals that belong to the same phase. Once vals, and classify these intervals into program phases [4] , tuning is complete, the best configuration is selected, and [8] to guide hardware reconfiguration.
subsequently applied whenever that phase is predicted.
This trial-and-error process may hurt performance, and Hardware phase detection has been studied extensively thsi.utb odutdefcety for uniprocessors. To our knowledge, however, no published work yet discusses general solutions to hardware Our baseline uniprocessor phase detector is Sherwood et phase detection in parallel systems. In this paper, we ad-al.'s BBV mechanism [8] (Fig. 1) CPI values in that phase. The identifier CoV is then defined the overall system-wide CoV curve. Figure 2 shows the as the average of all per-phase CoVs, weighted by how results. many intervals belong to each phase. A phase detector
As expected, when the BBV phase detector is applied to that classifies intervals into perfectly homogeneous phases a DSM context, it classifies intervals poorly as the number yields a CoV of zero, and CoV increases as phases deviate of processors grows. For instance, with two processors, from this ideal interval homogeneity.
LU achieves CoV values under 10% with as few as
CoV is naturally smaller with a higher number of seven phases.' At eight and 32 nodes, however, the CoV phases, since fewer intervals belong to each phase in the value for seven phases rises dramatically to about 40 and extreme case, every sampling interval would constitute a 70%, respectively. In fact, when running on eight and 32 distinct phase (each requiring tuning), with CoV trivially processors, LU only achieves a 20% CoV with 25 phaseszero. Conversely, all sampling intervals could be placed a two-fold degradation with nearly four times as many in the same phase, and thus tuning overhead would be phases as the two-processor case explained above. minimal but most likely futile, as the resulting CoV In summary, even as the BBV mechanism has been would be prohibitively large. To quantify this trade-off, in shown to successfully characterize the behavior of sethe next section we introduce the CoV curve, which plots CoV against a measure of tuning overhead (the fraction of qunilapctosbyimytrknghedtiuin CoV against a measure of tuning overhead (the fractionof of executed basic blocks [8] , several factors limit its efintervals that are spent in tuning).
fectiveness in a DSM environment running parallel codes.
First, the behavior of a thread may be affected by the III. PHASE DETECTION IN DSM other threads executing in the system by means of data
We conduct detailed simulations of a DSM multiproces-sharing patterns, memory traffic, network congestion, etc. sor with up to 32 nodes. Table I shows several architectural Second, data distribution (e.g., local vs. remote accesses) parameters of the system we model. We use two appli-may affect the behavior of the code executing on a node, cations from the SPLASH-2 suite (LU and FMM) [10] even when a processor executes precisely the same code and two applications from the SPEC-OMP suite (Art and without interaction with others. Unfortunately, the BBV Equake) [9] . Table II lists the applications and input sets. alone cannot (and was never meant to) capture these are defined independently by each processor, based on A. Uniprocessor Scheme on DSM the number of (locally) committed instructions. Hence, to convey information that is consistent with the requestor's We evaluate the efficacy of the BBV uniprocessor interval boundaries, each processor keeps separate frescheme in a DSM environment by adding a BBV phase quency counts of its accesses on behalf of each processor detector to each processor in our simulation framework in the system. Such counts are zeroed every time the (Section III). Each detector consists of a 32-entry accu-corresponding processor queries their content (see below). mulator and a 32-vector footprint the different classification criteria. The BBV+DDV con-VI. CONCLUSIONS figuration improves the CoV values achieved by the BBV baseline across the board. Moreover, the benefits of using
We have explored dynamic phase detection in DSM the DDV over the BBV baseline increase with the number multiprocessors. We have proposed and evaluated a novel of processors, as expected, since an increased node count hardware extension that builds on the BBV phase detector implies (a) more and longer accesses to remote data, and mechanism originally developed for uniprocessors [8] , but (b) higher variability due to interactions among threads, yields significant improvements in both the quality and the which are captured by the DDV.
number of the identified phases over the BBV hardware alone. It does so by tracking the access frequency, conIn FMM, for example, with 32 processors, BBV tnin n itnet aatuhdb ahpoesr achieves a 29% CoV using 25 phases. At the same number tnin n itnet aatuhdb ahpoesr of phases, the BBV+DDV detector reduces CoV to about
We believe that future work in this direction should 15%. The savings on the number of unique phases (and move toward combining the insights derived from our hence tunings) required to achieve a given CoV value are study with appropriate phase prediction mechanisms, to also dramatic: For instance, at a CoV value of 29%, the ultimately steer hardware reconfiguration of DSM multiaddition of the DDV reduces the number of phases from 25 processors. to 11. These same trends repeat for 8 processors, where the addition of the DDV consistently improves the CoV
