A program's microarchitectural resource requirements change as it goes through different phases of execution. Microprocessors, on the other hand, are designed to provide a fixed set of resources -leading to sub-optimal power and/or performance. Multi-configuration hardware that adapts to the programs' requirements has been shown to provide a much better power/performance tradeoff.
Introduction
By definition, general-purpose microprocessors are used for a variety of applications such as word processing, multimedia content creation, web servers, and game playing.
These applications have widely differing hardware resource demands and the exact set of applications running on the computer at any given time may also vary considerably. General-purpose microprocessors are, therefore, designed to provide good performance across a spectrum of workloads. On the other hand, this means that performance and/or power consumption is often non-optimal for a specific program or workload -or for that matter for a specific phase of program.
One way to optimize performance/power consumption is to dynamically tune the microarchitectural resources to match the program's requirements. Such an adaptive microarchitecture may consist of several multi-configuration functional units, each of which can be configured on-the-fly to adapt to a program's current requirements and characteristics. Researchers have proposed several such multi-configuration hardware structures including configurable caches, branch predictors, issue windows, and pipelines.
An important aspect of an adaptive microarchitecture is the algorithm that manages the multi-configuration hardware structures. The tuning algorithm must be able to detect changes in program behavior and reconfigure the microarchitecture accordingly. The algorithm must be efficient and robust, to get maximum benefit out of such a microarchitecture. Tuning an adaptive microarchitecture is complicated by the fact that the total number of possible configurations grows combinatorially with the number of multiconfiguration units deployed. Moreover, complex interactions between the units may make local tuning algorithms sub-optimal.
Previous work proposed a method for detecting phase changes by collecting instruction working set "signatures" [1] . These signatures were used as the basis of basic reconfiguration algorithms. In this paper, we consider properties of instruction working set signatures in more detail and extend signature collection to data working sets. Changes in data working set signatures are shown to be highly correlated with instruction working set signatures, further validating the use instruction working sets for detecting overall program phase changes. Sampling methods that reduce the overhead of signature collection are shown to retain phase detection accuracy. Based on working set signatures, we propose a general framework for managing several multi-configuration hardware units in an integrated fashion. Specifically, the paper describes a low overhead profiling technique and algorithms that can manage an adaptive microarchitecture containing a combination of configurable caches and branch predictor.
The rest of this section provides a brief overview of multi-configuration hardware and the co-designed virtual machine paradigm. Section 2 reviews certain basic aspects of working sets and addresses important issues such as stability of working sets and correlation between instruction and data working set changes. Section 3 describes some design optimizations to minimize the profiling overhead associated with working set signatures.
In section 4, we present an adaptive microarchitecture and the basic tuning algorithm.
Detailed evaluation of various flavors of the tuning algorithm is presented in section 5.
Section 6 briefly summarizes related work and section 7 concludes the paper.
Multi-configuration units
With power becoming a critical consideration for general-purpose microprocessors, several microarchitectural techniques have been proposed for reducing power consumption. There have been proposals for configurable caches and TLBs [2] [3] [4] , issue windows [5, 6] and pipelines [7] . Almost all of these techniques focus on adapting the shape and size of the hardware to match the program requirements. For example, caches can be configured such that they are big enough to just fit the program's working set. If the working set is small, a large part of the cache can be shut down, leading to power savings. There also have been proposals to use configurable hardware to improve performance. These include a configurable memory hierarchy [8, 9] and configurable branch predictor global history length [10] .
Future microprocessors will most likely employ a combination of these techniques.
This leads to a fairly complex optimization problem, especially if the methods interact with one another. In order to manage multiple configurable units, we develop algorithms that accurately detect program phase changes and trigger tuning only when the phase changes. As we will show in section 2, program phases are closely associated with instruction working sets and, instruction working sets can be stable over several million instructions. By tuning only when working set changes are detected, we save a lot of unnecessary reconfigurations. And, by using a history-based algorithm that relies on previously learned configuration information for recurring phases, it is possible to eliminate the need to go through the tuning process on every instruction working set (phase) change.
Sophisticated algorithms such as the ones we propose are more amenable to software rather than hardware implementation. A software implementation not only provides additional flexibility but also is likely to consume less power. Huang et al. [11] chose to implement their algorithm in low-level operating system routines. We use a co-designed virtual machine to implement our algorithms.
Co-designed virtual machines
A co-designed virtual machine [12, 13] consists of a layer of software designed concurrently with the hardware implementation. This software is hidden from all conventional software and would typically be developed as part of the hardware design effort.
The base technology is used in the Transmeta Crusoe [12] and the IBM Daisy/BOA projects [13] primarily to support whole-system binary translation. The IBM 390 processors use a similar technology, "millicode" [14] , to support execution of complex instructions.
In this work, we are not interested in the binary translation aspect. In fact, for managing configurable hardware, there needs to be no changes made to existing binaries. Referring to Fig. 1 , we use the virtual machine monitor (VMM) as a sort of "microoperating system" for managing configurable resources in the microarchitecture, to optimize power consumption for the individual subsystems and to manage them as a whole.
The VMM periodically checks for program phase changes. If it detects a phase change then it tunes the microarchitecture to the program. The VMM can maintain a table of static phases and their associated configuration information, in hidden memory. If a phase recurs, the tuning process is reduced to a table lookup. For long running programs, the phase table data is "implementation state" that can be saved in hidden memory when context switches occur [15] and restored later when the program resumes.
A detailed evaluation of the virtual machine is beyond the scope of this paper. However, section 5 does provide a rough estimate for the VMM overhead.
Program phases and working sets
Detecting program phase changes is key to implementing efficient and robust algorithms. A program phase is an intuitive concept, visually evident from time variance plots of program behavior. However, in practice phases are not easily determined, or even easily defined. Since phases are manifestations of the working set of the program [16] , we define a program phase to be equivalent to the program's instruction working set. This section addresses some of the issues related to this definition, such as stability of working sets, correlation of instruction and data working sets, etc.
Working sets
Classically, a working set W(t i ,τ) for i=1,2…, is defined as a set of distinct segments {s 1 , s 2, .., s ω } touched over the i th window of size τ [18] . The working set size is ω, the cardinality of the set. The segments are typically memory regions of some fixed size. For our application, segments are the size of an instruction cache block (64 bytes), unless specified otherwise.
In order to detect working set changes, we need a measure of similarity because the same program phase may not always touch exactly the same segments in each window due to small differences in execution. We use the relative working set distance
to compare two phases with working sets W(t i ,τ) and W(t j ,τ). Basically, this is the number of segments in which they differ normalized by the number of segments in their union. We define a threshold δ th and register a working set change only if δ > δ th . We have found that a threshold of 50% works well in practice because major phase changes are quite abrupt.
Program phase stability
Program phase behavior is caused as control passes through procedures and nested loop structures. As a consequence, each program has certain natural phase boundaries.
Ideally, we would like to detect phase changes at their natural boundaries. However, for design simplicity, we detect phase changes by comparing working sets in consecutive non-overlapping intervals (windows) of a fixed size. If the working set remains unchanged across multiple intervals, we consider it as one stable phase; if it keeps changing between consecutive intervals, we consider it as an unstable phase.
If the sampling window is a multiple of the natural phase length and is aligned with the natural phase boundary, then the program execution can be broken down into several stable intervals. This is the best-case scenario for tuning algorithms. It is however unlikely for such a perfect match to occur and thus the program executions look more like a series of stable phases separated by unstable regions.
The reconfiguration overhead in our application sets a practical lower limit on window size at about 100,000 instructions. There is no upper limit except that too big a window can lead to lost opportunities for reconfiguration. Fig. 2 shows the percentage of time spent in stable phases by the SPEC 2000 benchmarks for different window sizes.
Each benchmark was run for up to 20 billion instructions. In the figure, a phase is considered stable if the relative working set distance from the previous phase, as defined above, is less than 50%.
In Fig. 2 , there is no clearly optimal window size across the benchmarks because each benchmark has a different natural phase length. Within a benchmark, the stability does not monotonically increase or decrease with the window size because phases have a fractal like self-similar behavior where each phase is composed of several smaller phases.
For best results, the tuning algorithms should start at a small window (100K) and expand the window until the behavior is relatively stable. In this paper, we do not explore the op- tion of dynamically changing the window size. We use a window of 1 million instructions as it works well in most cases -leading to more than 80% of time spent in stable phases.
Phase change detection
Because successive instruction working sets have a certain amount of "noise" associated with them, our phase change detection mechanism should be able to filter the noise and detect only the significant changes. More importantly, the mechanism should be able to detect changes in the instruction as well as data working set sizes, as these determine the optimal cache sizes to be used.
To measure the efficacy of our detection mechanism we measured the relative work- ing set size change when a phase change is detected and the relative size change when there is no phase change between successive sampling intervals (1 million instructions).
If the mechanism is robust, the ratio between the two quantities should be large. Results are in Fig. 3 . As expected, the detection mechanism works really well for instruction working sets (Fig. 3a) . The working set size change is typically larger by several factors when a phase change is detected. In fact many of the bars (eg. perl) have been clipped off at 500% for clarity. The mechanism does not work well for art because it never stabilizes for a window size of 1 million (Fig. 2 ). Fig. 3b shows that the data working set size changes are also correlated to phase changes based on instruction working sets. However, data working set sizes are noisier compared with instruction working sets. This is because a given set of static instructions can touch varying amounts of data. Since the detection mechanism detects changes in instructions, some of the data size changes can go unnoticed. Our tuning algorithms do detect this type of behavior and take appropriate measures.
We also measured the correlation for branch working sets. The results were similar to those for the instruction working sets, which is expected. The figure is not included due to space constraints.
Working set signatures
In practice, working sets can be huge, so representing and manipulating true working sets is impractical for our application. Consequently, we use a lossy compressed representation of the working set called a working set signature [1] . The signature is formed by periodically sampling the working set over a window of instructions and hashing the samples into an n-bit vector. Working set change can be detected by using the relative signature distance. For two signatures S 1 and S 2 , the relative signature distance is defined
i.e., (ones count of exclusive OR)/(ones count of inclusive OR). As with full working sets, we will use a threshold value ∆ th in order to detect phase changes.
The working set size can also be estimated using the size (number of ones) of the signature. Probabilistically, if k is the number of segments in the true working set, then randomly hashing them into an n-bit vector leads to a fill factor f, given by
Given the signature size, this relation can be used to estimate the true working set size. This information can be used to directly configure certain multi-configuration hardware, whose performance is directly correlated to the working set size (for example, caches).
Bit-vector size
The signature bit-vector size limits the largest working set that can be represented effectively. Small bit vectors get saturated with relatively small working sets, leading to reduced phase change detection accuracy due do to increased collisions. For phase change detection, we use instruction working set signatures. Fig. 4 shows the correlation (Pearson product moment correlation coefficient) between signature changes and working set changes for four different bit vector sizes. There is a distinct difference between the integer and floating point benchmarks, due to the differences in their working set sizes. Integer benchmarks, unlike floating point benchmarks, have relatively large instruction working sets and so the correlation drops with bit-vector size due to the increased saturation. Overall, bit-vectors of size 1Kbit and more work quite well, with correlation exceeding 95% in all cases.
Signature saturation also leads to large errors in estimation of working set size due to the asymptotic relationship between the working set size and the signature fill factor [1] .
For our application to configurable caches (section 5), we need to accurately estimate instruction and data working set size up to 32KB and 512 KB respectively. For instruction working set size estimation, a 1Kbit signature works quite well, leading to a fill factor of 40% (equation 3) for a 32 KB working set -well below saturation. For data working set size estimation, we build data working set signatures. These are created by sampling load and store addresses and are used only for size estimation. For a 1Kbit signature, a 512 KB 1 Pearson product moment correlation coefficient working set leads to a fill factor of 99.97% -highly saturated and hence undesirable. A 4Kbit signature works well -leading to a fill factor of 86%. In rest of the paper, we use instruction and data signatures of size 1024 bits and 4096 bits respectively.
Hardware implementation
So far, we have used srandom() (a C library function) as our random hash function.
This function is too complex for direct hardware implementation. We found empirically that a simple folded XOR function performs equally well for our application. The hardware required is minimal as it involves breaking the key into three parts and performing an exclusive-OR on them.
Hardware implementation also puts a constraint on the number of keys that can be sampled. For example, if the machine retires 4 instructions per cycle, it may not be feasible to sample all four PCs every cycle. Moreover, to minimize the power dissipated by the profiling hardware, it may be desirable to profile 1 in 8 or 1 in 16 keys. the correlations fall off. However, a sampling rate as low as 1 in 8 instructions works well -showing about 95% correlation on average. Fig. 6 shows the average correlations between signature size and working set size for both instruction and data working sets. We correlated the theoretical fill factor and the experimentally observed fill factor, and averaged it over all the benchmarks. The graph shows an interesting phenomenon. Unlike the instruction working sets, the correlation for data working sets falls rapidly. This is because instructions are reused more often than data and hence the probability of missing an event due to periodic sampling is reduced. A sampling rate of 1 in 2 data references is evidently sufficient (Fig. 6) . Because only about 1/3 of instructions are data references, the sampling rate per instruction is about the same as the instruction sampling rate.
Adaptive microarchitecture
An adaptive microarchitecture is composed of several multi-configuration units, profiling hardware and a tuning algorithm. The next three sub-sections describe each of these in some detail.
Multi-configuration hardware
The adaptive microarchitecture we studied is based on the Alpha 21264 microarchitecture. The microarchitecture has four multi-configuration units -the L1 instruction and data caches, unified L2 cache and the branch predictor (Fig. 7) . Each of the units can be configured into four different sizes. The L1 instruction and data caches are 2-way set associative, with a maximum size of 64KB. The L2 cache is 4-way set associative, with a maximum size of 1MB. The branch predictor is a gshare predictor with a maximum size of 8K entries. The global history length is fixed at 10. Table 1 contains specific details about each of these units.
The caches are reconfigured by changing the number of active sets. This can be implemented via sleep transistors as proposed in [2] . Changing the number of sets in the cache does require resizing of the tags. Thus, the number of tag bits maintained in the cache is equal to the number of bits required for the smallest cache. This adds a very small power overhead to the cache. The predictor is reconfigured by power gating sections of the branch history table.
Reconfiguration leads to some loss in performance due to increased misses and mispredictions as a result of lost implementation state. Reconfiguration of the L1 data cache and the L2 cache is also associated with a performance overhead of writing back dirty lines. Dirty lines have to be written back to the lower level of the hierarchy if the mappings change or sets are deactivated. This leads to a considerable reconfiguration overhead, especially for the large L2 cache -over 100,000 cycles for the 1MB configuration.
We reduce some of the overhead by performing writebacks only on lines that map to a different set in the new configuration. A further reduction in overhead can be achieved by changing the associativity of the cache rather than the number of sets. However, this approach can lead to very highly associative caches in the maximum configuration. We did not explore this option because our tuning algorithm amortizes the cost of reconfiguration over long phases, so writeback of dirty lines is not prohibitive.
Profiling hardware
The profiling hardware provides the VMM with working set information and number of misses generated by the various units (Fig. 7) . The instruction and data working set information is collected over a window of 1 million instructions using 1Kbit and 4Kbit signatures respectively. The hardware samples 1 instruction and 1 data reference every 2 clock cycles. Because the machine is 4-wide, this leads to an average sampling rate of 1 in 8 instructions and 1 in 2 data references, which was shown to be adequate in the previous section.
The profiling hardware also collects the number of cycles executed over the sampling window, the number of misses generated by the instruction cache, data cache, unified L2
cache, and the number of mispredicts generated by the branch predictor. Almost all of this information is present in performance counters on modern microprocessors.
In our model, we assume that certain "special" instructions are provided to the VMM to read/write profiling hardware and control registers. However, these instructions are hidden from conventional software, and hence no ISA changes are required.
Tuning algorithms
Tuning is the process of finding the minimal configuration amongst a set of configurations, such that, the performance (IPC) degradation relative to the best achievable is within a (small) specified range. The complexity of tuning several units lies in the fact that the number of possible configurations increases combinatorially with the number of multi-configuration units. It can be argued that certain units such as caches can be tuned by examining their miss rates. However, this may not be possible because the exact contribution of the miss rates towards the performance varies with time.
The algorithms we propose are based on accurate detection of program phase changes and dynamically allocating misses to each unit. ble, else it is considered unstable. The configurations are set to the maximum during an unstable phase. The algorithm starts in UNSTABLE state.
Once a stable phase is detected, the algorithm transitions to the AVERAGE state, wherein the number of cycles, misses and mispredicts per 1 million instructions are computed by averaging over a fixed number of intervals. Given, the number of cycles per interval (CPI), the number of extra cycles tolerable (∆CPI) is computed from the specified IPC tolerance (∆IPC) as follows.
The extra cycles are divided equally amongst each of the N units (four in our case) and the number of tolerable misses per unit (∆miss i ) is computed from the miss penalty (P i ) as follows Once the allocated misses for each unit are computed, the tuning process is decoupled. Each unit is independently tuned under the constraint that it's misses stay under the allocated number (Fig. 8b) . Tuning for a unit stops as soon as the optimal configuration for that unit is found. The algorithm transitions to UNSTABLE state whenever an unstable phase is detected (i.e. the relative signature distance is greater than 50%); the configurations are set to maximum and the tuning process starts all over again.
It should be emphasized that decoupling the tuning process is based on the fact that the number of misses and thus cycles remain stable over a phase. This condition is sometimes difficult to satisfy due to the inherent noise associated with data working sets. In such situations, appropriate "back-off" measures need to be taken. The next section addresses this issue in more detail.
Evaluation
We used a modified version of the Alphasim 1.0 [18] simulator, which provides a detailed model of the 21264 and its memory system. We compared the performance of tuning algorithms to a baseline configuration with 64KB L1 instruction and data caches, 1MB unified L2 cache and an 8K entry gshare branch predictor. We ran simulations to ensure that the baseline was not over designed, especially the 1MB L2 cache. We found that many SPEC 2000 benchmarks such as gcc, bzip2 and swim did benefit significantly from the large L2 cache. The microprocessor core parameters are similar to the 21264 [19] . Table 1 shows some of the important parameters of the microarchitecture. and gap have a mix of short and long noisy phases. In order to limit simulation time, benchmarks were run for at most 5 billion instructions. For most benchmarks, simulating 2 billion instructions was sufficient to capture most phases. However, benchmarks mcf, parser, wupwise and gcc were run for 3, 4, 4 and 5 billion instructions respectively to capture their major phases.
We did not simulate virtual machine monitor instructions. On every call, the VMM computes the relative signature distance, which involves computing one 128 byte XOR, one 128 byte OR and two 128 byte population counts -each of which is a highly parallel operation. The tuning algorithm involves fewer than 100 arithmetic operations. Overall, even if we assume that every call to the VMM incurs an overhead of 1000 instructions executed, the net performance overhead is limited to less than 0.1% on average (assuming the VMM runs every 1 million instructions).
The following subsections describe the various tuning algorithms explored. The algorithms use an IPC tolerance of 2%, i.e. the microarchitecture is configured such that the IPC degradation is limited to at most 2%. Each of the algorithms described stay in the AVERAGE state (Fig. 8a ) for 2 intervals before allocating misses. Fig. 9 shows the performance achieved by the tuning algorithms and Fig. 10 shows the average unit sizes achieved by the algorithm, normalized with the baseline configuration.
Simple tuning
The simple tuning algorithm is essentially the algorithm described in section 4.3 (Fig.   8 ). The algorithm works quite well for floating point benchmarks, but suffers significant performance degradation for most integer benchmarks (gcc, bzip2, gzip, mcf, parser). On average, it achieves 45% reduction in instruction cache size, 28% reduction in data cache size, 28% reduction in L2 cache size and 33% reduction in branch predictor size compared to the baseline configuration. The performance loss relative to the baseline is 3.44%.
Performance loss seen in integer benchmarks and can be attributed to one or both of the following reasons.
1. While tuning, the decision to increase or decrease the size of a unit is based on observing the number of misses over a million instructions. Sometimes, trying out a smaller configuration causes the misses to exceed the allocated number by several orders of magnitude. In benchmarks with relatively short phases (e.g. gzip and bzip2), the extra misses cannot be amortized over the phase length, leading to a significant performance loss.
2. The tuning algorithm assumes that the instruction and data working sets remain stable over the entire phase. However, as shown in sections 2 and 3, this may not always be true -especially in the case of data working sets. Once tuning is completed, any extra misses caused due to working set changes go unnoticed -leading to performance loss.
Tuning with back-off
The performance loss in the previous algorithm was partly due to absence of an eager recovery policy and partly due to complete reliance on working set signatures to detect changes in program behavior. In order to fix this, we extend the simple tuning algorithm with the following back-off mechanisms. It is evident from Figs. 9 and 10 that the back-off mechanisms work. The average unit sizes increase slightly, but the performance losses reduce significantly. On average, the algorithm achieves 45% reduction in instruction cache size, 20% reduction in data cache size, 26% reduction in L2 cache size and 29% reduction in branch predictor size with a relative performance loss of 0.88% -well below the specified tolerance.
Signature size based tuning
It has been shown [1] that certain units, whose performance depends on the working set size, can be directly configured by estimating the working set size from the signaturesize (fill factor). We found that this reconfiguration technique works well for the instruction cache and L2 cache, but not quite well for the data cache. In case of data caches, the signature-size based algorithm over estimates the required size. This is because data cache misses can be partially hidden by out of order issue and hence cache sizes smaller than the working set work well in practice.
The signature-size based algorithm estimates the instruction and data working set size from the respective working set signatures (equation 3). Since instruction cache misses are relatively expensive, the instruction cache is conservatively configured such that the working set fits in 90% of the cache. The L2 cache size is estimated by adding up the instruction and data working set size. The predictor is tuned using the trial and error algorithm in Fig. 8 . The algorithm also implements the back-off mechanisms described in the previous section.
The algorithm achieves 48% reduction in instruction cache size, 2.5% reduction in data cache size, 23% reduction in L2 cache size, and 29% reduction in branch predictor size ( Fig. 10) with a relative performance loss of only 0.45% (Fig. 9) . As expected, savings in data cache size are negligible. It is noteworthy that the algorithm achieves a much smaller instruction cache size for twolf. This is because the trial and error algorithm is more susceptible to noise associated with phases and happens to lock in a larger configuration for a particular (long) phase. The signature-size algorithm is based on a statistical profile and thus immune to minor fluctuations in the working set size. It is thus able to pick up a configuration that provides good average performance for a given phase.
Phase table based tuning
The phase In the presence of noise, two different static phases can sometimes alias to the same entry, leading to non-optimal configuration. If the configuration is smaller than optimal, the back-off mechanisms come into play and appropriately increment the configuration.
However, if the configuration is larger than optimal, it stays there -leading to larger unit sizes on average. To avoid this, the phase table is updated with the smallest configuration found for the particular phase.
On average, the algorithm achieves 44% reduction in instruction cache size, 22% reduction in data cache size, 25% reduction in L2 cache size and 26% reduction in predictor size for a performance degradation of 1.25%. The phase table based algorithm does not perform better than the simple algorithm with back-off, because the latter is able to amortize most of the tuning overhead. However, it does show that the concept of reuse holds potential and could be used to reduce performance loss for algorithms with higher tuning overhead.
Summary
To summarize, the tuning algorithm with back-off provided the best performance-size tradeoff amongst all the algorithms. We compared the performance of this algorithm to a statically configured microarchitecture with similar, but slightly better/larger resources -48KB 3-way set associative instruction and data caches, 768KB 3-way set associative L2
cache and an 8K entry gshare predictor.
The statically configured microarchitecture showed 2.4% performance loss relative to the baseline as compared to 0.88% loss in case of the dynamic algorithm. More importantly, some benchmarks such as gcc and twolf showed performance losses as high as 8%
and 11% respectively. These results demonstrate the ability of adaptive microarchitectures to overcome the shortcomings of static configuration.
Related work
Huang et al. [11] described a framework and algorithms that are intended to deal with processors containing several configurable units. The algorithms described in this paper differ in two respects. First, their algorithm targets coarse grained tuning such as voltage reduction and performance degradation as high as 10% is tolerated. Our algorithms use a much tighter tolerance (2%) and target fine grained tuning of individual units. Secondly, their algorithm tries tuning at a fixed time period while our algorithm triggers tuning only when program phase changes are detected.
Dhodapkar and Smith [1] showed that signatures can be used to efficiently represent program working sets and to detect program phase changes. They also showed that instruction cache reconfiguration algorithms based on detection of program phase changes performed quite well compared to existing algorithms. However, the signatures used in [1] were based on ideal randomizing hash functions and the evaluation was based on miss-rates of a single unit rather than complete timing simulation.
Sherwood et al. [20] propose the use of a basic block vector (BBV) to capture the behavior of a part of a program. A BBV is similar to a working set signature except that each element of the vector represents the frequency of execution of a particular basic block in the program. As an application, the authors demonstrate the use of BBVs to speed up simulation by finding a few representative simulation points.
Conclusions
Tuning adaptive microarchitectures with several multi-configuration units is complicated due to the large number of possible configurations. In this paper, we described a profiling architecture and unified tuning algorithms for managing an adaptive microarchitecture with multi-configuration caches and branch predictor. The profiling architecture generates instruction and data working set signatures, which form the basis for the tuning algorithms. We show that signatures generated by sampling as low as one instruction and one data reference in every eight instructions, contain sufficient information to accurately detect program phase changes and estimate working set sizes.
The proposed tuning algorithms are based on accurate detection of program phase changes and dynamic allocation of misses/mispredicts to individual units. The algorithms decouple the tuning processes of individual units and significantly simplify the problem of managing several multi-configuration units. We proposed three different tuning algorithms -a basic trial and error based algorithm, an algorithm that uses signature-size to configure caches and a phase history based algorithm that uses previously found configurations for recurring phases, thereby reducing the tuning process to a table lookup.
Each of these algorithms was shown to work quite well for our application. The best performing algorithm achieved 45% reduction in instruction cache size, 20% reduction in data cache size, 26% reduction in L2 cache size and 29% reduction in branch predictor size with a relative performance loss of just 0.88%. The dynamic algorithm was also shown to outperform a statically configured microarchitecture with similar resources.
As part of future work, we plan develop algorithms that dynamically adapt to the natural phase boundaries of the program and manage several other units besides caches and predictor. We also plan to implement the VM software to ascertain that the overheads associated are indeed low.
