ABSTRACT In this paper, we focus on a very long instruction word (VLIW) processor design that ''shares'' its cache blocks when switching to different performance modes to alleviate the aforementioned cold starts. The switching trigger cache resizing operations and improper use can lead to less efficient cache performance. We clearly note here that our investigation pertains the local temporal effects of the cache resizing and how we counteract the negative impact of cache misses in such resizing instances. We propose a novel reconfigurable d-cache framework that can dynamically adapt its least recently used (LRU) replacement policy without much hardware overhead. We demonstrate that using our adaptive d-cache, it ensures a smooth cache performance from one cache size to the other. This approach is orthogonal to future research in cache resizing for such architectures that take into account energy consumption and performance of the overall application. Our results show that when compared with a straightforward cache resizing approach, we can achieve a reduction of 10%-63% of cache misses during the switching period. Furthermore, we can also increase the average cache hit rate from 56% to 90% in the cache part that remains live after downsizing using our approach.
I. INTRODUCTION
In modern-day manycore architectures, different heterogeneous cores are being put together to target a multitude of applications and even target different performance levels as, e.g., big.LITTLE, does. This approach performs well when the matching of applications with cores, through task scheduling, is limited in frequency of occurrence. In these architectures, dynamic and frequent task switching between the cores will degrade the performance considerably as instructions and data need to move to the caches of their respective cores, i.e., resulting in cold starts.
Recently, another many-core approach was proposed that allows for the dynamic separation and merging of cores, though parameterization of the datapaths of a very long instruction word (VLIW) processor organization. This design has been prototyped on an FPGA -no simulation -and entails run-time adaptation towards ILP (instruction-level parallelism) or TLP (task-level parallelism) of a single or
The associate editor coordinating the review of this manuscript and approving it for publication was Nitin Nitin. multiple application(s), respectively. This design is unique in that it can be seen as multiple small cores or a single large core using the same set of datapaths within the same design [1] . For example, low-ILP applications execute on a small number of those datapaths while the remaining datapaths can be used to form more cores. The mentioned issue-width adaptation in turn requires run-time adaptation of the data caches (d-cache) in order to match the core configuration, i.e., a mismatch will lead to unnecessary performance loss caused by increased cache misses. The cache adaptation entails upsizing/downsizing the d-caches when an external trigger causes a core to use more or less datapaths, respectively. We have to note that upsizing does not cause any cold start effects as all live data remains accessible in the upsized cache. However, downsizing can in the worst case lead to a cold start when all live data was resident in the 'deleted' cache block during downsizing.
A. BACKGROUND
One such design is the ρ-VEX processor [2] , [3] , which is a reconfigurable and extensible softcore VLIW processor.
It differs from traditional VLIW processors, in that the issue-width is parameterized from 2 to 4 to 8 -the core contains a maximum of 8 datapaths. The switching between the issue-width modes is performed at run-time and can be achieved in 5 clock cycles. Combined with generic binaries [4] , there is no need to maintain different versions of the application code and, therefore, allowing for interrupts and resumption of code execution without the need for checkpointing in the code, even after switching modes. A key motivation of the ρ-VEX processor design is to utilize only the necessary resources when needed [5] . For example, use the 2-issue mode for low instruction level parallelism (ILP) programs while using the 8-issue mode for high ILP programs. This notion can also be applied to execution phases within programs, e.g., control-oriented code sections (with low ILP) versus data-processing code sections (usually having high ILP). Finally, unused datapaths can be combined to form additional 2-issue or 4-issue cores in order to run multiple programs in parallel. In this manner, a gradual trade-off between ILP and thread-level parallelism (TLP) can be achieved without the need to move instructions and data around from one core to another. The latter certainly will cause delays as the instruction/data caches need to filled again with live instructions/data, respectively. This is representative of current-day many-core architectures. The proposed adaptive data cache (d-cache) (introduced in the next paragraph) resides within the processor.
B. MOTIVATION
The dynamic nature of the ρ-VEX processor requires an adaptive cache organization that can combine several caches into a larger sized one, or separate a larger cache into smaller sized ones as depicted in Figure 1 -this is commonly referred to as cache resizing. More precisely, the combination or separation of caches follow the issue-width mode changes of the ρ-VEX core. Any other design choice would result in unbalanced cache sizes for the different core issue-widths. Our d-cache is a multi-way set-associative cache in which the resizing is performed by adapting the number of ways of the cache. The reason for utilizing a multi-way set-associative cache is due to a much better utilization of the cache resources compared to a reconfigurable direct-mapped cache and due to the simpler logic to control the combining or separating the cache blocks associated with the different cache ways. This means that the 8-issue, 4-issue, and 2-issue cores have access to a four-way, two-way, and one-way set-associative cache, respectively.
With these considerations, we investigated the effects of cache resizing triggered by the issue-width mode changes (caused by external factors) of the ρ-VEX core in different scenarios [6] . For our experiments, we chose the MiBench embedded benchmark suite [7] as the ρ-VEX core is intended to be utilized in low-power embedded systems. The core bases its decision for a mode change solely on how effectively the datapaths are being utilized when executing programs. A coordinated effort with information from the caches to drive the processor mode changes is subject for future work. In the case of upsizing (combining multiple ways to form a larger cache), no detrimental effects are expected as no live data was removed and the additional cache resources will gradually decrease the cache miss rates. However, in the case of downsizing (disabling one or more ways), we observe immediately increases in the cache miss rates. The main reason for the increased miss rates is the elimination of live data when disabling certain ways in the downsizing process.
We reiterate here that the resizing decision is made by an external source, i.e., either by the processor or by the operating system. our proposal depends on the possibility that there is sufficient time given to our approach to ''prepare'' the caches before the actual reconfiguration of the processor. Also, we propose a dynamical mechanism of the d-cache to decline the side effects of cache downsizing. We exploit this transition time to bolster the amount of live data in the (cache) ways that remain active by giving them a higher preference when loading new data (when a miss occurs) and by moving data to these ways from the to-be-disabled ways (when a hit occurs).
To the best of our knowledge, this paper is the first to propose a dynamic adaptation of the re-configurable cache in respect of the re-configurable processor. We are considering scenarios in which mode changes occur frequently (due to external factors) and attempt to limit the impact of these mode changes on the d-cache performance per occurrence. Therefore, our approach can be orthogonal to other re-configurable cache approaches.
C. CONTRIBUTIONS
We target a dynamic adaptive cache that follows the mode change of processor. Our proposal focuses on the sharp increase of cache misses during a cache resizing event to improve the system performance impact of such an event.
The experimental results show that we can reduce the number of cache misses by between 10% and 63% compared to immediate cache downsizing without taking any measures. The contributions of our paper are as follows:
• To the best of our knowledge, our proposal is the first to target the mode change of the processor event as an external trigger to reconfigure the d-cache rather than checking the cache miss, and resizing the d-cache to react to varying demand for cache size both within and across applications. VOLUME 7, 2019 • We propose the concept of a transition period prior to the actual cache resizing. During the transition period, the mechanism of cache migrates the active data and ensures a smooth cache performance from one cache size to the other. In particular, we demonstrate that our main results can be achieved with a transition period of just 2000 clock cycles up to 4000 clock cycles.
• We implement a novel, simple, yet effective cache replacement policy, which migrates the accessed data to the enabled (active) cache ways after resizing, to reduce the transition cache misses in order to smoothen and thereby improve the performance (of the application) during the transition period and right after the downsizing event.
• Our approach allows for immediate cache downsizing without the need to maintain ''live'' portions of the cache after the downsizing event.
The remainder of the paper is organized as follows. In Section II, we motivate when and how to exploit cache resizing. In Section III, the framework of the reconfiguration and the extension replacement policy are presented. In Section IV, we describe the evaluation environment. In Section V, the experimental results are discussed in detail. In Section VI, we present the related work. In Section VII, we draw our conclusions. Finally, our future work is presented in Section VIII.
II. KEY ISSUE OF OUR APPROACH
Based on the fact that the issue-width mode of ρ-VEX core can be changed dynamically, we exploit the cache resizing opportunity in executing applications, hence our approach of resizing caches primarily differs in two aspects: (1) when to resize, and (2) how to resize. In this section, we highlight the above aspects respectively and also propose our solution. The details of the resizing framework are described in Section Framework.
A. WHEN TO RESIZE
In some cases, the processor system faces the requirement of the reconfiguration frequently. The frequent context changes are not only from among applications but also from aggressive phases of the same application. As mentioned earlier, we assume that external factors determine whether the d-cache should be adjusted. Meanwhile, our proposal depends on the possibility that there is sufficient time given to our approach to ''prepare'' the caches before the actual reconfiguration to minimize the side effects. Figure 2 depicts the two external factors that trigger the cache resizing. First, it is probable for an operating system (OS) to apply for new resources that execute its kernel threadsïĳŇand then resume the original thread. Rather than completely suspending the context of the running program, OS can temporarily shorten the resources using by the running program. In this manner, the running program remains responsive. This simple fact must not be overlooked as it represents a key aspect of the ρ-VEX core design. Second, low ILP phase within the application also leads the core to determine to shorten resource utilization so as to save core energy. On the contrary, as for high ILP phase, the core has to allocate new resource to adapt to this change. In the above two cases, the requirement of core's reconfiguration will all result in a corresponding cache resizing.
The running applications most likely will have different phases as well as the various ILPs. Figure 3 illustrates this and depicts the different phases of the jpeg − djpeg benchmark. We can clearly observe that the phase changes are not instantaneous, but have a clear slope. The start of the slope can be easily detected with simple counters that keep track of the number of utilized datapaths as well as when the slope plateaus again at a different ILP level.
In conclusion, both external factors deciding the processor mode change, that in turn determine the cache resizing, and allow for a preparation time before the actual change is actuated. In the remainder of this paper, we refer to this preparation time as the transition time.
B. HOW TO RESIZE
In the case of upsizing (combining multiple ways to form a larger cache), no detrimental effects are expected as no live data was removed and the additional cache resources will gradually decrease the cache miss rates. The key idea to reduce the cache misses after downsizing is to keep up as much live data as possible within the downsized d-cache. To achieve the goal, the first step is to identify the live data. Subsequently, the second step is how to treat the live data.
Before we discuss these two objectives, we introduce the terminology used throughout this paper. For simplicity, we take a switching of mode from 8-issue to 4-issue to illustrate the transition mechanism in Figure 4 . There are two crucial instant times in the graph. One is t decision , the other one is t switch . t decision represents the moment in which the decision is taken to perform the reconfiguration of the processor. t switch represents that the actual moment that the reconfiguration occurs, i.e., the processor has just switched its mode. The interval from t decision to t switch is the transition period we proposed in this paper. In Figure 4 , the left 4-issue core0 inherits the 8-issue core's property after the reconfiguration. Hence, way0 and way1 will keep enabled for the running program during the transition period, while the state of way2 and way3 go to the transition state. Thus, the right 4-issue core1 can be used by other tasks. Note that as t decision coincides with t switch , our strategy is just immediate cache resizing, i.e., the traditional cache resizing.
To identify the live data, we can use active methods as outlined by [8] , [9] that require an additional address correlation table. This adds hardware overhead and in turn increases power consumption. In our opinion, it is much easier to perform this identification by exploiting the temporal likelihood of re-accessing data -this information is already kept by the least recently used (LRU) replacement policy. We therefore correlate the most live data with the most recently used data.
Subsequently, we must determine how to deal with the identified live data. In our approach, we distinguish between the enabled and disabled ways of the cache after downsizing. It is therefore important that the 'enabled' ways contain as much live data as possible after downsizing. We achieve this through two simple methods. First, we slightly adapt the existing LRU algorithm by restricting the number of ways to the enabled ways only before the downsizing when handling cache misses. Second, we perform a data transfer from the disabled ways (before deactivation) when a cache access hits in one of the ''disabled'' ways. It is therefore important that the transition is long enough for our approach to achieve its goals. In Section V, we present the effectiveness of our approach with varying transition times.
C. ANALYTICAL MODEL
In order to present the benefit from our approach, we use the following model to discuss our analytical model. Figure 5 depicts the difference of the miss rate under two type of cache resizing methods, i.e., our proposal and the immediate cache resizing. The x-axis stands for the observation period after the t switch instant time. The y-axis represents the cache miss rate from our approach normalized the with immediate cache resizing.
Due to migrating the active data during the transition period, a large amount of active data is maintained in the enabled part of cache. Hence, the cache miss rate does not abruptly rise immediately after the capacity is reduced, but slowly get close to the curve (y = 1) and eventually coincides with the curve (y = 1). 1 If there is no transition mechanism, the active data is forced to disable, leading to a sharp rise in cache miss rate and jumping directly to y = 1. The area of the blue shadows in Figure 5 is the exact benefit of our approach. Therefore, one of our primary concern is to employ the area as big as possible after the cache resizing occurs. The other concern is how long this lower cache miss rate can go on. This period is called the lasting effect time.
B(T) =
A mathematical description of the lasting effect B(T ) is given in Equation 1 . B(T ) is the function of the transition time frame. f (t) is the function of the time under the immediate cache resizing, while f'(t) is the function of the time under the transition time approach. Therefore, the benefit of our proposal is the integral function of f (t) and f'(t), that is, from the actual resizing time t switch to the time T.
III. FRAMEWORK OF CACHE RECONFIGURATION SYSTEM
The key challenge in our design is to apply a mechanism to migrate the live data into the enabled part of cache. In this section, we discuss the implementation of the framework and the novel replacement policy on the reconfigurable d-cache.
A. HARDWARE IMPLEMENTATION
In the design of the ρ-VEX processor, the reconfigurable cache relies on way-associativeness. A key advantage of selective-ways is that dividing the cache is more efficient. In particular, the selective-way method [10] is able to adapt to dynamic issue-width. The selective-way circuit does not appear in critical paths in the cache so that the tags and the data decoders can be operated in parallel.
The basic cache organization studied in this paper is derived from conventional SRAM-tag caches, in which the tag and data arrays are organized as SRAM structures. We added a way mask register, modified the tag bits, and added a simple combinatorial circuit to enable/disable the appropriate cache arrays. These are small changes compared to the current set-associative cache organization.
As depicted in Figure 6 , the way-selection logic (WSL) is deployed to transfer the active/inactive way-enable signal to the corresponding way. The way-mask register is a 4-bit vector register that has the same amount of the associativity of cache. It is offered to set which way is allocated to the corresponding core. Hence, each value of the way-mask register is also responsible for the corresponding cache ways.
Given the core of 8 issue-widths, the value of the way-mask register is 1111, which denotes all of the ways are available. In such case, the d-cache acts as a four-way set-associative cache.
If the core's issue-width is specified as 4, the available values of the way-mask register are 1100 or 0011, which denotes the left ways or the right ways of the way-associative cache, respectively. In such case, the d-cache acts as a two-way setassociative cache.
When the issue-width is specified as 2, there are four available values (0001, 0010, 0100, and 1000) since the four 2-issue cores can be run independently. The d-cache acts as a direct-mapped cache.
B. RESIZING-LRU REPLACEMENT POLICY
Focusing on the dynamic mode changes of the processor, We have modified and improved the traditional LRU replacement algorithm, called resizing-LRU (R-LRU) as shown in Algorithm 1. R-LRU replacement policy has to satisfies the cache requirement of resizing within the transition period. The key issue of the novel cache algorithm is to migrate the cached data blocks to the enabled cache part within the transition time, taking into account the inherent temporal locality of the workloads. In this following subsection, we divide R-LRU replacement policy into three scenarios and then carefully discuss them, respectively.
• Case 1: hit in the disabled part.
• Case 2: hit in the enabled part.
• Case 3: miss in the whole d-cache. Subsequently, we utilize LRU-based stack as depicted in Figure 7 to discuss the advantages of R-LRU. R-LRU stack consists of six different LRU stacks during the transition period. On the left side of the figure, MRU position represents the most recently used block and LRU position represents the least recently used block. Thus, the position 1 is the more recency position than position 2. The blue block means the cache block in T state and is going to switch to D state after the transition period. On the right side of the figure, we use a three-column form to show 18 kinds of scenarios of cache access corresponding to the above mentioned LRU stacks.
• Case 1 (hit in the disabled part). A hit in the disabled part is identical to a miss in the enabled part. The last enabled block of the LRU list is evicted and we replicate the hit data of to-be-disabled part to this position. Although the capacity of the hit set probably decreases, R-LRU only replicates the hit data instead of visiting the lower level cache. Therefore, the expense of the case is less than an actual miss. Furthermore, there are three scenarios that evict the LRU block and only one scenario that evict block located the secondary of the LRU list, as shown in the second column of the form. An cached block has a particular quality of temporal locality, i.e., it is probable to be accessed once more in the near future. In such manner, R-LRU improves the number of most recently used data in the enabled part, which shows a benefit of R-LRU replacement policy.
• Case 2 (hit in the enabled part). For a hit in this part, the R-LRU replacement policy just move the hit blocks to the head of list so that updates the LRU list. This update is the same as the traditional LRU replacement policy. The more hits in the enabled part, the better the locality of temporal and spatial keeps up in this part. It is effortless to maintain the performance after downsizing. There is no additional overhead for deploying R-LRU in this case. As a result, R-LRU keeps up cache performance as shown in the first column of the form.
• Case 3 (miss in the whole d-cache). As for a cache miss, the referenced block is only brought into the enabled part. R-LRU finds the last enabled block in the LRU list and evicts it. In this case, there are three scenarios that the evicted blocks occur in the LRU position, which are the same as the conventional LRU replacement policy. In the remains of three scenarios, there is one eviction occurring at Position1 and two evictions occurring at Position2. Considering the LRU block is evicted at once, R-LRU slightly adjusts the sequence of the LRU list and brings the new block into the enabled way in advance.
In such way, we can achieve the benefit from the R-LRU replacement policy if the LRU block is not accessed any more before being evicted. In summary, the R-LRU policy allows in the transition period to transfer the accessed data from the to-be-disabled way of cache to the enabled way or boost the live-ness of the data in the enabled way. Ideally, all the enabled blocks will reside in the first N positions of the LRU list when the transition period ends. In other words, the entire recently used nodes are located at the head of the LRU list. When the t switch gets close, our approach optimizes and minimizes cache miss penalty introduced by cache reconfiguration. on the performance directly, which is a key factor in the framework. The parameter controls when to start the transition state. A larger value means a longer transition period causing data to be migrated earlier, which may pollute the cache. On the other hand, smaller values may lead to lost opportunities to move live data and, in turn, weaken the migration mechanism. In fact, our predictor is capable of foreseeing the switch of the issue-width mode several thousands cycles ahead. Therefore, it creates an opportunity to exploit the large reconfiguration space on the parameter interval. In Section V-A, we present the impact on the performance in detail.
Algorithm 2 Cache Resizing Scheme Framework

IV. EVALUATION
In this section, we present our experimental environment and benchmarks employed in our evaluation. Subsequently, the impact on cache misses and the lasting effects are presented in Section V.
A. BENCHMARKS
The MiBench benchmark [7] suite is used to evaluate the cache performance. It contains a collection of embedded and portable applications. The benchmarks were compiled with Hewlett-Packard's vex-3.43 compiler [11] . Table 1 presents an overview of the benchmarks, in which we list the number of operations, the number of bundles, 2 the number of memory accesses (load and store instructions) (in %), the average amount of bundles per phase and the number of phases for each application of the MiBench benchmark.
B. METHODOLOGY
The baseline of 8-issue's ρ-VEX configuration is presented in Table 2 . As explained above, the largest ρ-VEX core using the 8-issue mode has a four-way set-associative d-cache. As for the 4-issue mode, the whole cache is divided among the two cores and therefore also half cache size (32 Kbytes, 2-way) for each core. Likewise, as for the 2-issue mode, the whole cache is divided among the four cores and each core has a 16 Kbyte direct-mapped cache. Our experimental platform mainly consists of the following elements:
• Cache simulator: We extended the DineroIV [12] cache simulator, which is a sequential trace-driven cache simulator, to be able to simulate the reconfigurable cache as presented in Section II.
• (Memory) trace generator: We used a hardware trace unit to generate instruction traces for each benchmark on the FPGA prototype of ρ-VEX. We extracted the memory read and write operations from this traces for use as input to the cache simulator.
• Core phase predictor: We implemented a simple phase predictor to measure the ILP of the benchmark traces and predict/decide the most suitable mode for the ρ-VEX core to execute in, e.g., low-ILP in 2-way mode and high-ILP in 8-way mode. In addition, this predictor takes into account the trade-offs in terms of delay, energy consumption, and the energy delay product (EDP) to make the phase predictions [2] . These predictions are utilized to make the mode decisions and these are fed to the cache simulator leading to cache resizing. As explained in Section II-A, the predictor also outputs the signal to indicate when a resizing decision is pending.
• ρ-VEX prototype: We use an FPGA to prototype the ρ-VEX and run applications on actual hardware. The design runs on a Virtex 6 (ML605 development board) at 37.5 MHz. We use GRLIB [13] interface with main memory and other peripherals. The target for the generic binaries is an 8-issue machine with 8 ALUs, 4 Multipliers, 1 Load/Store unit and 1 Branch unit. The layout of the functional units is shown in Table 2 . The layout for 4-and 2-issue versions that support the generic binary is also listed in Table 2 , respectively. The platform we built supports two kinds of cache dynamic reconfiguration, i.e., cache upsizing and cache downsizing. As the cache upsizing happens, which means the cache memory system increases the associativity, data arrays and tags maintain the ever cached data without degrading the performance. Hence, this paper mainly evaluates the scenario of d-cache downsizing. In the light of the trigger signal of the mode switch, we perform the reconfiguration of d-cache downsizing in the framework proposed which combines the R-LRU replacement policy with various transition periods. To be fair, we also simulate the benchmark with the immediate cache resizing method (the interval of transition time is zero cycles) at the same points. As mentioned before, the ρ-VEX processor design supports three issue-width modes, namely 8-issue, 4-issue and 2-issue. Correspondingly, there are three reconfiguration scenarios of d-cache downsizing, i.e., (1) from a four-way set-associative d-cache to a two-way set-associative d-cache, (2) from a two-way set-associative d-cache to a direct-mapped d-cache, and (3) from a four-way set-associative cache to a direct-mapped cache. Finally, the ρ-VEX processor is always initialized to the 8-issue mode. After skipping the initialization and fully warming up the cache, we perform some different measurements.
V. RESULTS
In this section, in order to illustrate the effectiveness of our framework to minimize the impact of cache downsizing, we explore several transition intervals and measure how the miss rates can be reduced. Finally, the lasting effect and the total decline of cache misses are discussed. Table 3 lists the amount of cache reconfiguration occurring in every benchmark. Figure 8 depicts how the interval of transition impacts on the cache performance.
A. IMPACT OF THE INTERVAL OF TRANSITION
The y-axis of three sub-graphs indicates the cache miss decline after the actual cache downsizing compared to a transition period of zero cycles ( i.e., immediate cache downsizing). We evaluate various interval of transition ranging from 10 to 10000 bundles cycles (see x-axis), and then calculate the cache misses within 2000 cycles after the actual downsizing occurs. All the results are normalized to immediate cache reconfiguration at the same execution point. The statistics are classified into three kinds of cache downsizing (i.e., the modes switch from 4-issue mode to 2-issue mode, from 8-issue mode to 2-issue mode and from 8-issue mode to 4-issue mode), respectively.
It demonstrates that our proposed framework has the same capacity to decrease cache misses in the three modes of cache downsizing, which clearly proves an advantage compared with the immediate cache reconfiguration approach. The longer the transition period is, the more the reduction in cache miss rate is. Given the transition period of 2000 cycles, the results of most benchmark converge towards a near-optimal performance. More specifically, we achieve a decline of cache misses as shown in Figure 8 , on average 13% for 8-issue to 4-issue, 26% for 8-issue to 2-issue and 16% for 4-issue to 2-issue, respectively. For the switch scenario from 8-issue to 4-issue, 16 of the 20 benchmarks show cache miss rates continuously rapidly decrease with the extension of the transition period. It is also same for 12 benchmarks in the switch scenario from 4-issue to 2-issue and 11 benchmarks in the switch scenario from 8-issue to 2-issue. In addition, blowfish, FFT , CRC32, gsm − untoast, bitcount and jpeg−djpeg benchmarks are very sensitive to the interval of transition. The decrease of cache misses are even up to 46% for blowfish in the 8-issue to 4-issue case and 63% for CRC32, 60% for bitcount in the 8-issue to 2-issue case.
In Figure 8 , we still notice that for blowfish in the 8-issue to 2-issue switch and for tiffmedian in the 4-issue to 2-issue switch, the number of misses slightly increases in our framework compared to the immediate downsizing approach due to the reason that the majority of memory references are mapped in the same set. For such benchmarks, our method possibly pollutes the d-cache replacing other useful data, eliminating the performance benefits. Figure 9 shows given a transition period of 2000 (bundle) cycles, the cumulative lasting impact of the MiBench benchmarks on one complete execution in which includes all kinds of the mode changes. All the results are normalized to the immediate cache resizing, i.e., the transition period is set zero cycle. The black curve represents the average level of all benchmarks.
B. LASTING EFFECT
It is obvious that the cache miss curve gradually tends to y = 1 (red line) instead of jumping directly to it. Consequently, our approach keeps low cache misses after the cache reconfiguration. Likewise, the area between the black curve and the red line is the benefit we achieve. In particular, for the blowfish, FFT , basicmath, jpeg − djpeg benchmarks, the decline of cache miss rate is ranging from 13% to 37% and for these benchmarks the beneficial effect of our approach lasts till 7500 cycles after the actual downsizing event.
Our approach achieves a reduction of the cache miss for all the MiBench benchmark, on average, 16% (2000 cycles after cache reconfiguration), 14% (4000 cycles after cache reconfiguration), and 9% (7500 cycle after cache reconfiguration). Based on our experiments, as long as the transition interval is set 2000 (bundle) cycles, our proposal can increase cache performance to more than 6000 (bundle) cycles after a cache reconfiguration. This framework is especially useful in the case from 4-way to 2-way set associative cache.
With the R-LRU cache replacement policy running in the transition period, the downsizing moments will not result in a sharp increase of cache misses (y = 1 in Figure 9 ). For cache upsizing, due to being repopulated the newly cache resources, it will not lead to an immediate recovery of a cache miss. Yet, our framework smoothen the cache misses and when an upsizing event occurs before the (lasting) effect of our method completely subsides, the cache miss rate can rise again from a much lower point. For instance, the cache recovery can start at any point on the cache miss curve. In this way, our framework can greatly reduce the cache misses in a dynamic environment where the cache size must be quickly adjusted. Our measurements further reinforce the result by allowing a transition period of approximately 1000-3000 to be sufficient for the main outcome of our approach.
C. FURTHER INVESTIGATION
In the previous sections, we present the results of our approach in maintaining as much live data as possible in the ''enabled'' cache part after downsizing. Figure 10 presents the percentage of cumulative hit and miss counts within the utilized caches that makes a distinction between whether a way was enabled or disabled after downsizing. In Figure 10 , each left bar stands for cache behavior summary with R-LRU turned off, i.e., immediate cache resizing without R-LRU method. Each right bar stands for the one using R-LRU method. We can clearly observe from each right bar of Figure 10 that the enabled way maintains a high hit rate.
After a more careful investigation, we notice that there is a second contributing factor to the results as depicted by each right bar in Figure 10 , namely the fact that we always choose the same way to enable when downsizing the cache.
For that reason, we rerun our experiments with our R-LRU approach turned off. The results are depicted by the left bars in Figure 10 . This illustrates that our approach is capable of increasing an already high hit rate from 81% to 98% on average. However, having a hit rate of 81% was somewhat unexpected as one would expect a more 50%-50% distribution in a set-associative cache. For this reason, we decided to minimize the effect of ''always preferring the same cache way'' when downsizing by alternating the choice for the cache way to remain enabled. The results are presented as the left bars in Figure 11 (again, R-LRU is still disabled).
As expected, each left bar in Figure 11 shows that a hit rate in the enabled way is closer to 50%. For our last experiment, we alternated the choice of the ''enabled'' way and turned on our R-LRU approach once again. The experimental results are depicted as the right bars in Figure 11 . We can clearly observe from the results that our R-LRU approach is capable of increasing the hit rate from about 50% to 90% compared to only from 81% to 98% when a cache way is preferred. We have to note here that our approach is much more versatile allowing arbitrary choices of the ''enabled'' cache way. Finally, we measured again the transition time that is required for our approach (without way preference) to achieve the 98%-level as presented in Figure 11 and established this to be only about 4000 cycles. Summarizing, our migration is more cost-effective for the above quantitative analysis.
VI. RELATED WORK
Reconfigurable caches offer the flexibility to change one or more of their design parameters over time. There have been various research efforts on cache resizing techniques. Most of these works are cache tuning approaches [14] , [15] , which address the problem from two aspects: when to decide to resize a cache and how to resize the cache. We have note here that these approaches perform cache resizing in order to adapt the cache for long-term benefits regarding cache utilization, low power/energy consumption, performance, etc. On the other hand, our approach reacts to cache resizing events to minimize the local (in time) performance impact inflicted by increased cache misses. Therefore, it is orthogonal to the cache reconfiguration approaches presented here.
First, previous researches on ''when to resize'' tactics mostly depended on monitoring the cache hit/miss ratio or profiling program, determining the proper time to adjust the cache. In [9] , [16] , [17] , these approaches only decided to cache resize according to the cache hit/miss rate. Such conventional approaches are all called hit/miss driven methods. In [18] , the dynamic profiling analysis is used to predict the cache usage and energy efficiency for applications with multi-cache configurations. Reference [19] define a thread's active cache footprint (ACF) in a period to guide cache resizing.
However, cache misses are not always a better performance indicator [16] . Although several factors may affect cache hit/miss ratio, even tiny changes in program behavior or available cache space may lead to sharp changes in miss ratio [15] , [20] . These hit/miss driven approaches likely thrash cache performance. Likewise, the profiling methods raise the hardware/software overhead as well as the hit/miss driven approaches. Unlike previous works, as far as we know, our paper is the first to introduce two external events to activate dynamical cache reconfiguration once the issue-width mode of the VLIW processor switches. At the same time, while reducing cache size, our method takes decreasing cache miss rate into account. As a result, our work alleviates cache miss penalty so that the performance of the cache is improved.
Second, the previous studies on the strategy of ''how to resize'' were various. Some works reconfigured number of sets or ways, or size of cache lines to meet a certain size at different granularities. Based on these granularities, cache resizing techniques can be classified as wayresizing [10] , [21] - [26] , set-resizing [17] , [27] and hybrid (set-and-way resizing) [16] - [18] . Reference [16] proposed a method to smooth the performance of reconfigurable caches, which is closely related to our work. Their work targeted to preserve the closing part of cache activity and was left to decaying gradually rather than immediately shut down the disabled part of the cache. However, the closing part of cache probably produces cache hits for a long period, even lasting until the program completes. This led to an extremely long transition period of cache resizing.
Our work, however, differs significantly from [16] in the following ways. First, our method's transition time starts at the decision time of cache resizing and ends at the actual switch time. Hence, caches are required to reconfigure and to free the disabled portion of cache in a finite period of time after caches receive the reconfiguration decision. Second, we propose a novel extension cache replacement strategy which migrates the live blocks proactively to the enabled portion of cache ahead of the actual switch time. Third, in the frequent context mode changes between applications and aggressive phases of the same application, our approach can achieve more benefits. We also evaluate the cache performance benefits of the replacement policy. Therefore, we verify that our framework of cache reconfiguration can smoothen the performance after cache resizing. In [28] , the authors investigated that higher requirement for cache resources does not invariably provide positive correlation with excessive cache resources, which means cache access that hits in N-ways set-associative cache as well hit the cache with more than N ways. In [28] , the authors' aim is to partition the cache in accordance with this phenomenon, whereas we make full use of this information to distinguish the high utility cache ways from the low utility cache ways.
VII. CONCLUSIONS
The issue width mode changes of the ρ-VEX processor allow for much better core resource utilization, but it requires an adaptive cache to follow suit. The mode changes can also be triggered by an operating system when multiple processes/threads need to be supported. The added benefit of using the ρ-VEX processor is that the threads do not need to be switched out to allow for other threads to run. We can simply reduce the resources for the running thread to free enough resources for the second thread to execute. In such a scenario, changing the issue width mode also translates to a downsizing of the set-associative cache attached to the executing core. In this way, we eliminated the need to move data from one cache to another. However, the downsizing will result in the deletion of useful live data. The work described in this paper aims to reduce the increased miss rate penalty associated with the downsizing event by increasing the amount of recently used data within the to-be-disabled cache prior to the downsizing event. We proposed an adaptive reconfigurable d-cache combined with an R-LRU cache replacement strategy with low hardware overhead. It demonstrates that our approach can maintain a low cache miss rate when the transition period is at most 6, 000 cycles, yet a 2, 000 cycle transition period is sufficient for good results. Also, our proposal prevents miss rate from sharp increasing by on average between 10% and 63% due to cache shrinkage.
Finally, cache performance improvements will be further enhanced in the case of issue-width mode frequently changes.
Of course, Amdahl's Law dictates that if these downsizing events are only a small part of the total application execution time, the effect is minimal. We only achieved the results of short periods after occurring reconfiguration. However, the computing system may perform frequently core reconfiguration (and therefore also cache reconfiguration) to keep up with a high level of responsiveness with the remaining majority of its performance. Our in-depth investigation of the cache hit rates of the enabled part of the cache after downsizing shows that we can improve this rate from 56% to 90% or even 98% when considering a transitation period of only 2000 or 4000 cycles, respectively.
VIII. FUTURE WORK
To further improve the reconfigurable cache performance, we propose a feedback mechanism, which means a coordinated effort with information from the caches to drive the mode changes. This proposal will ensure that more benefits can be achieved from the dynamic and frequent reconfiguration. Meanwhile, this design will eliminate the low yields reconfiguration not only for cores but also for caches.
