Abstract-While performance and power continue to be important metrics for embedded systems, as CMOS technologies continue to shrink, new metrics such as variability and reliability have emerged as limiting factors in the design of modern embedded systems. In particular, the reliability impact of pMOS negative bias temperature instability (NBTI) has become a serious concern. Recent works have shown how conventional leakage optimization techniques can help mitigate NBTI-induced aging effects on cache memories. In this paper we focus specifically on scratchpad memory (SPM) and present novel software approaches as a means of alleviating the NBTI-induced aging effects. In particular, we demonstrate how intelligent software directed data allocation strategies can extend the lifetime of partitioned SPMs by means of distributing the idleness across the memory sub-banks.
Abstract-While performance and power continue to be important metrics for embedded systems, as CMOS technologies continue to shrink, new metrics such as variability and reliability have emerged as limiting factors in the design of modern embedded systems. In particular, the reliability impact of pMOS negative bias temperature instability (NBTI) has become a serious concern. Recent works have shown how conventional leakage optimization techniques can help mitigate NBTI-induced aging effects on cache memories. In this paper we focus specifically on scratchpad memory (SPM) and present novel software approaches as a means of alleviating the NBTI-induced aging effects. In particular, we demonstrate how intelligent software directed data allocation strategies can extend the lifetime of partitioned SPMs by means of distributing the idleness across the memory sub-banks.
I. INTRODUCTION AND BACKGROUND
Memory subsystems have long been known to be a critical component in the overall performance for embedded platforms. However, more recently, these memory systems have also been demonstrated to be the major contributor to the total power budget. The problem of power dissipation in memories is exacerbated in ultra-deep sub-micron CMOS technologies, where static power due to leakage currents (and sub-threshold current in particular) is coupled with high dynamic power dissipation [1] .
The leakage power is extremely critical for memories, where the high density of integration translates into high power density. The latter is the main source of heat generated across the substrate, which, if not quickly removed through packaging and/or cooling architectures, may induce drastic increases in temperature. Working at high on-chip temperatures affects both performance (MOS transistors and global wires get slower at high temperatures [2] , [3] ) and static power consumption (leakage current increases exponentially with temperature [1] ). For these reasons, several leakageaware memories solutions, ranging from software and system level hierarchy optimization to architecture and circuit level structures, have been proposed over the past few years (e.g., [4] , [5] ).
While performance and power continue to be important metrics for embedded systems, as CMOS technologies continue to shrink down below 65nm, new metrics such as variability and reliability have emerged as limiting factors in the design of modern embedded systems. This is especially true when considering memory architectures; access to/from memory is involved in every executed instruction (e.g., data and instruction fetch), and once the memory system becomes unreliable, the reliability is compromised for the system as a whole.
Although process variation is the most evident source of variability, and thus, unreliability [6] , [7] , time-dependent deviations in the operating characteristics of nanoscale MOS devices [8] have recently been shown to be one of the more critical issues in determining the lifetime of CMOS circuits. In particular, the reliability impact of pMOS negative bias temperature instability (NBTI) has become a serious concern [9] .
In CMOS logic, NBTI effects occur on pMOS transistors when they are negatively biased (i.e., a 0-logic is applied to the gate of the pMOS, resulting in V gs = V DD ), and manifest themselves as an increase in the threshold voltage V th over time [9] . Such a V th variation has a direct impact on the longterm stability of traditional 6T-SRAM cells, whose signal noise margin (SNM) degrades over time, thus altering the capability of a cell to reliably store a correct logic value [10] - [12] . Experimental data report variation of V th of about 10-15% per year, which translates into more than a 10% SNM degradation (after 3 years) depending on the target technology and the operating conditions [11] .
The push to embed reliable and low-power memory architectures into modern systems-on-chip is driving the EDA community to develop new design techniques and circuit solutions that can address these critical issues. Very recent works [13] , [14] have shown how conventional leakage optimization techniques [4] , [5] can offer a valuable solution to mitigate NBTI-induced aging effects on memories while still obtaining reductions in power dissipation. More specifically, they have quantified the beneficial effects of popular power management schemes such as power-gating and dynamic voltage scaling (DVS) on aging. Power gating, when implemented using sleep transistors, has the effect of completely nullifying the aging effects [15] . Similarly, but with a smaller impact, voltage scaling improves NBTI-induced aging because a reduced V DD corresponds to a smaller bias voltage. As reported in [13] , reduction in SNM as a function of V dd is roughly linear; the degradation of SNM under a "drowsy" voltage V dd,drowsy is about 60% of the degradation at the nominal V dd .
Although the majority of these works focus on cache memories, there exist other types of SRAM memory structures whose efficiency also impacts the total power-performance tradeoff, as well as on the lifetime of the embedded systems. Among them, scratch-pad memory (SPM) is the most significant example. SPMs are widely used in embedded systems for image and video processing applications that make heavy use of multi-dimensional arrays of signals and nested loops. For these classes of applications, the flexibility of caches in terms of workload adaptability is often not needed; instead, performance predictability, power consumption, and implementation costs play a much more critical role.
Even though caches and SPMs perform the same function (namely, temporarily holding small chunks of data for highspeed, frequent retrievals), their implementation and management is completely different. The main differentiating factor is that, while caches guarantee full transparency at the cost of more hardware resources and non-deterministic latency, SPMs, which do not need any caching logic, are faster, more predictable and less power consuming. Note that for SPMs, it is the designer that decides the mapping of addresses to locations into the SPM, not the hardware.
This new degree of freedom in the design space opens a completely unexplored area in the field of concurrent poweraging memory optimization. Namely, it allows for the use of innovative software approaches as a means of alleviating the NBTI-induced aging effects. In contrast to previous works that focus on pure circuit/architectural cache solutions (e.g., [14] , [16] ), in this paper we investigate new software controlled NBTI-aware data managing solutions for low-power SPMs.
Building off of previous findings that SRAM cells in a low-leakage state (i.e., idle state) are less affected by NBTI stress [13] , we demonstrate how intelligent data allocation strategies can extend the lifetime of partitioned SPMs by means of distributing the idleness across the memory subbanks. The basic reasoning behind this approach is that, while from a leakage viewpoint it is the total number of idle blocks that matters, for aging, it is the distribution of idleness across the memory banks that can help maximize the lifetime of the entire SPM, and thus the system as a whole.
To support our claims we developed a dedicated library of C-functions that implement NBTI-aware data allocation through a dedicated malloc, called SPM malloc. This function is aware of the current aging of each memory bank and maps the heap of each task such that all the banks age at the same rate. More specifically, the data are allocated such that all the banks can spend the same amount of time in the idle state (low-V DD state). Dedicated lookup tables containing precharacterized NBTI-induced SNM degradation of a standard 6T-SRAM cell mapped into an industrial 45nm technology are used to estimate the aging of the SPM's sub-banks. We demonstrate through a motivational example the effectiveness of our approach and show that overall idle times across all banks of the SPM can be more evenly distributed, thereby preventing some banks from aging faster than others and increasing the reliability of the memory system overall.
II. ARCHITECTURAL OVERVIEW
Our target architecture is shown in Figure 1 . The baseline configuration consists of an ARM7 CPU, coupled with a L1 cache and a block of scratch-pad memory. A Direct-Memory Access engine (DMA) is also present, and it is in charge of accelerating the data movements between the on-chip and offchip memories. This type of architecture has many applications (e.g., smartphones, cameras, game consoles etc.), and it is widely adopted in the embedded domain. In our simulations we are excluding the DMA in order to see what would be the behavior of a system that does not use one. This also allows us to estimate the worst-case performance scenario and expect an improvement if we add a DMA in the future. Next, we describe the proposed hardware and software extensions. 
A. Hardware Extensions
Similarly to [17] , we propose a multi-banked type of scratch-pad memory, with independently powered banks. This kind of structure has been studied also in caches (e.g., [18] ), mainly for power concerns. As shown in Figure 2 , a special control unit sets the power state (i.e., active, or drowsy) for each bank. We deliberately didn't consider a shutdown state, since that configuration would complicate the design of the scratch-pad memory. The insertion of extra sleep transistors in the memory cells, in fact, would not only impact the complexity but also on the performance of such critical component.
The control unit is memory mapped, and can be programmed with regular write operations. In order to simplify the programming process, we implemented some interfacing functions (described in Section II-C). We estimate that reactivating a bank from the Drowsy state incurs a realistic performance penalty of 0.2μs. This overhead is in line with that of other commercial embedded processors (e.g., [19] , [20] ).
B. Extension to Memory Allocation
Our study is based on the assumption that a significant portion of embedded systems do not have a priori knowledge of the running tasks (e.g., downloading new applications on a smartphone). It may also be that the memory required by a task is unpredictable at compile time (e.g., I/O triggered memory allocation). For such systems, static (i.e., compiletime) memory allocation strategies can not be applied. Several dynamic approaches have been proposed in the past, but none of them were dealing with software exposed multi-banked SPM architectures.
From the programmer point of view, we propose a simple extension to the regular malloc(), as described in Table I. We replace the call to malloc() with the new function SPM malloc(). The SPM malloc() method accepts a new parameter: the address of the 'requestor' (i.e., the pointer to the object to allocate). The address of the requestor is fundamental in case we want to move the data from one bank to another. In that case, in fact, also the pointers to the objects allocated (i.e., requestors) in the source bank should be updated with new values.
Updating the requestors is done automatically by several library functions that are hidden from the programmer. Table II reports these main library functions. Note that extra metadata was required to manipulate the banks and hence the object stored in the banks. For this reason, we create two new tables, namely spm obj table and spm bank info, containing the actual state of each bank and object, as shown in Table III . Next, we will describe the most important library function: SPM banks manager().
C. The SPM banks manager() Procedure
The key feature of this method is to move the data automatically from one partition to another whenever the aging indices of two banks differ for a quantity bigger than a given threshold. Figure 3 shows the call precedence graph for managing such a function. At the moment the function is called from within the SPM malloc() method (i.e., every time an object must be allocated), but this is not a requirement. The OS could potentially invoke that functions after regular intervals of time (e.g., once every hour). The algorithm starts with all the banks of the scratchpad memory turned off. At the beginning of the algorithm, one bank must be turned on for the first time. After we turn on the first bank, the algorithm estimates the maximum difference in aging between all the turned on and turned off banks. This is done by calculating the value of psleep for all banks (0 < i < number of banks) where
and total idle time i is the sum of all the time intervals that bank i has been turned off until now, or
We expect that the smaller the psleep value for a bank, the greater its aging, based on the fact that a bank that is turned off for a smaller percentage of overall operation will be exposed to more aging effects. We define the psleep value of a particular bank as the ratio of the time that the bank is off to the total time of operation. We then compute the difference between the psleep value of the off bank with the maximum psleep value (i.e., the least aging) to the psleep value of the on bank with the minimum psleep value (i.e., the greatest aging). Mathematically, this can be expressed as:
where:
Moreover we define Delta as :
If Delta is greater than a given threshold, then these two banks with min/max psleep values must switch data and status. That is, the selected turned off bank will be turned on and will accommodate the data written in the selected turned on bank and this turned on bank will now be turned off. We call this threshold Delta threshold. If the difference between psleep values is negative, then no modification will be made on the data that are allocated in the banks, or the status of the banks. Figure 4 illustrates the decision flow of the algorithm.
At the end of the algorithm we expect that the psleep values will be more evenly distributed among the banks and that the worst psleep value (i.e., the smallest one) will have increased, indicating a smaller worst case bank aging than before.
III. PRELIMINARY RESULTS
Due to time limitations we will not be able to give extensive results here. Instead, we will have to rely on a motivational example that shows how we can influence the shift in signal noise margin with our software managed NBTI-aware memory allocation policy.
As explained in SectionII-B, we are targeting systems which do not have a compile-time knowledge of the running tasks. To mimic this system behavior, we implemented a microbenchmark that executes a parameterizable number of tasks, each of them instantiating a certain amount of objects. The tasks are chosen from a set of basic matrix manipulation functions (ie. multiplication, addition and lu decomposition). Each simulation accepts as input the percentage of SPM utilization (20%, 40%, 60%, 80%). We experimented with various values of the Delta threshold. The appropriate value of Delta threshold should not be very small, since in that case swaps between banks would occur very often, thus affecting the overall performance. Still, Delta threshold should be small enough so that swaps will eventually occur. We concluded that the aging improvement is maximized for values of Delta threshold around 30. This means that if the psleep max is at least equal to 1.3 × psleep min then a swap should occur. The results we are presenting are for Delta threshold = 30.
We investigated three different bank management policies: vanilla:
The SPM contains a single bank always turned ON. power-aware:
The banks can be set in the Drowsy state to save power. This policy is unaware of any aging effect. 
NBTI-aware:
The aging status of the banks is monitored by the SPM bank manager() function. The banks are set in Drowsy or Active state according to the SNM degradation. For each simulation, we collect the worst of the SNM degradation values, since the the worst bank (i.e., the one with highest aging) defines the aging of the entire SPM. Figure 5 shows the SNM degradation after three years for different percentages of SPM utilizations. As expected, we see that the proposed NBTI-aware policy reduces significantly the SNM degradation, especially for lower utilization values. From Figure 5 for 20% utilization, the SNM degradation shows an improvement of about 32% while for 80% utilization there is a lower improvement of about 13%. Note that, from an aging prospective, the power-aware and the vanilla policies are equivalent, since in the minimum value for the idleness (which is proportional to the aging) is in both case 0. The proposed NBTI-aware technique, instead, tries to equalize the different aging indices by triggering changes in the power states of the banks. Figure 6 shows the minimum psleep value, i.e., the percent of idle time for the bank that has been active for the most time.
As expected, as we increase the percentage of utilization of the SPM, the benefits of our proposed solution diminish. The chances of finding free space to allocate and move data from another bank (which will be also set in the Drowsy state) to another are in fact reduced. This leads us to the conclusion that our technique gives better aging improvement for lower values of SPM utilization.
Performance is the main parameter that we expect to be affected by the NBTI-aware technique, since the applied algorithm, and specifically the SPM banks manager() procedure, needs some extra time to compute the psleep values for each bank, sort them, compute the Δpsleep and if necessary swap the data between the banks and update the banks status.
In order to estimate the performance overhead, it is essential to define how often the SPM banks manager() procedure Figure 7 . We observe that the performance overhead lies between 0.29 Mc and 3.55Mc, which corresponds to 0.7% and 8.8% of the total execution time respectively. As the time window increases, the performance overhead decreases since the SPM banks manager() procedure is not called as often. Moreover, the system does not have the maximum posible aging benefits from the NBTI-aware technique and thus the SNM also decreases. This is illustrated with the red line on Figure 7 , which plots the SNM Degradation for several time windows. For this reason it is important to choose the time window that results in a good combination of aging benefits and performance overhead, i.e., the time window that minimizes |SN M Degradation × Overhead|. As illustrated in Figure 8 the |SN M Degradation × Overhead| product is minimized for a time window of 16 Mc. In this case the SNM Degradation is −12.16 and the performance overhead is only 0.29 Mc, which corresponds to 0.7% of the total execution time. So with careful selection of the simulation parameters we can achieve a considerable aging improvement whithout affecting significantly the performance of the system. This is essential for embedded systems since both performance and reliability play a determinant role.
IV. CONCLUSIONS AND FUTURE WORK
In this paper we have introduced a new method for managing memory allocation in scratchpad sub-banks such that the effect of NBTI aging are reduced. Since the SPM is managed by software in terms of address mapping, it is appropriate that we present a method for NBTI aware data allocation that is also managed by software. Through simple extension to the regular malloc() we show through a simple motivational example that the idle times for each bank in the SPM is more evenly distributed, compared to a NBTI-oblivious memory allocation scheme. This even distribution prevents some banks from aging faster than others, thereby increasing the reliability of the memory system overall. The SNM degradation can be improved up to 32% when applying the proposed technique. The performance penalty lies between 0.29 − 3.55Mc which corresponds to 0.7% − 8.8% of the total execution time. Our proposed technique manages to improve the aging effects while meeting the performance needs of embedded systems. For future work, we also plan to experiment with a broader range of benchmarks and determine how utilization, number of banks, and different psleep thresholds can affect the overall aging in terms of signal noise margin of the SPM.
