Voltage scaling to values near the threshold voltage is a promising technique to hold off the many-core power wall. However, as voltage decreases, some SRAM cells are unable to operate reliably and show a behavior consistent with a hard fault. Block disabling is a micro-architectural technique that allows low-voltage operation by deactivating faulty cache entries, at the expense of reducing the effective cache capacity. In the case of the last-level cache, this capacity reduction leads to an increase in off-chip memory accesses, diminishing the overall energy benefit of reducing the voltage supply. In this work, we exploit the reuse locality and the intrinsic redundancy of multi-level inclusive hierarchies to enhance the performance of block disabling with negligible cost. The proposed fault-aware last-level cache management policy maps critical blocks, those not present in private caches and with a higher probability of being reused, to active cache entries. Our evaluation shows that this fault-aware management results in up to 37.3 and 54.2% fewer misses per kilo instruction (MPKI) than block disabling for multiprogrammed and parallel workloads, respectively. This translates to performance enhancements of up to 13% and 34.6% for multiprogrammed and parallel workloads, respectively.
Introduction
For recent CMOS technologies, power density is the main performance limiting factor across most computing segments. Moore's law continues to hold, with a doubling of the number of transistors and integration density in each new process generation, but Dennard scaling no longer applies, and we are not able 5 to keep a constant power density across technology generations. Power budgets prevent us from utilizing all the available transistors, leading to dark silicon [1] .
For years, industry has relied on scaling the supply voltage (V dd ) to reduce power consumption, but this trend has dramatically slowed since the 90 nm generation because of leakage. Reducing operating voltages to values near the 10 threshold voltage (V th ) would minimize leakage and switching power consumption.
The resulting power reduction could be used to activate more chip resources and potentially achieve performance improvements [2] .
Unfortunately, V dd scaling is limited by the tight margins of the on-chip cache SRAM transistors. Excessive parameter variations in SRAM cells limit the 2 In contrast, last-level caches (LLCs) are usually shared and have larger sizes and associativity, accounting for much of the die area [6] . Hence, for LLCs, minimum-geometry 6T cells are preferred to achieve higher densities. 30 At the architectural level, fault-tolerant cache designs rely on disabling faulty resources at different granularities [7] , or correcting defective bits through either error correction codes (ECCs) [8] or a distributed duplication of blocks [9, 10] .
Block Disabling (BD) is a simple technique that disables a cache entry when a defective bit is found [11] . It is already implemented in modern processors 35 to protect against hard faults [6] . However, due to the random distribution of defective cells, the capacity of the cache is rapidly compromised. Complex techniques based on ECCs or the combination of faulty resources are able to rescue more cache capacity, but incur large storage overheads and sometimes require complex remapping that penalizes the cache access latency. 40 In our work, we have developed a new approach to mitigate the impact of SRAM failures in LLCs due to parameter variations, based on BD but also relying on the underlying structures already present in CMPs. We identify a natural source of on-chip data redundancy that arises because of the replication of blocks in inclusive multi-level cache hierarchies and exploit this redundancy 45 through a smart fault-aware cache management policy.
In this paper, we make the following contributions. First, we provide an evaluation of BD techniques in a shared-memory coherent CMP running parallel and multiprogrammed workloads with a complete and detailed memory model, exploring SRAM cells with different probabilities of failure. Second, we introduce 50 a technique that keeps the tags of the LLC and, therefore, the tracking capabilities of the coherence directory operational. This way, a block not physically stored in the LLC can reside in the private level and be made available to other cores.
As an alternative to main memory supply, we set up a cache-to-cache copy service to support code or data sharing (thread migration, operating system, or 55 parallel workloads). Finally, we propose a fault-aware cache management policy that predicts the usefulness of a block based on its use pattern, and guides the allocation of blocks to faulty and non-faulty cache entries, adding no overhead 3 to the original replacement policy.
Our fault-aware cache management policy is able to decrease the LLC misses 60 per kilo instruction (MPKI) by up to 37.3%, with respect to BD, which translates to speedup improvements of 2 to 13% for multiprogrammed workloads. For parallel workloads, the MPKI values decrease by 5 to 54.2%, with respect to BD, for the different SRAM cells considered, improving performance up to 34.6%. This paper extends our previous work [12] in several significant ways: i) a new 65 fault-aware cache management policy aiming at caches operating at low voltages, ii) a detailed implementation of block disabling with operational tags (BDOT) technique, iii) a more realistic SRAM fault model, improving the accuracy of the results, and iv) a more detailed evaluation including multi-programmed workloads and cache capacity/energy analysis.
70
The rest of the paper is organized as follows. Section 2 introduces the problem of process variations and its effect on SRAM cell reliability. Section 3 comments on BD and its impact on large cache structures. In Section 4, we describe how to take advantage of the coherence infrastructure to operate at low V dd . Section 5 introduces a fault-aware cache management policy for LLC operating at low 75 voltages. Section 6 describes the methodology. Section 7 presents our evaluation. Section 8 discusses the system impact. In Section 9, we comment on related work, and in Section 10, we outline our conclusions.
Process Variations in SRAM cells
SRAM structures are especially vulnerable to failures due to process variations, 80 as they are aggressively sized to meet high density requirements, and because of the vast number of cells that comprise on-chip SRAM structures [13] . In particular, intra-die random dopant fluctuations (RDFs) are the main cause of threshold voltage variation [14] . The stochastic nature of the ion implantation process leads to a distribution of V th values across a chip, which reduces the 85 already tight transistor margins. Hence, SRAM structures have a minimum voltage, V ddmin , to guarantee reliable operation, which is typically of the order 4 of 0.7-1.0 V in current process generations, when 6T cells are used.
The robustness of SRAM cells under the V ddmin range has been extensively analyzed in the literature [10, 9, 3, 8, 15] . Zhou et al. studied six different 90 sizes of 6T SRAM cells in 32 nm technology, and their probabilities of failure as V dd decreases [4] . According to that study, at 0.5 V, the probability of failure of an SRAM cell (P f ail ) is between 10 -3 and 10 -2 . The use of larger cells reduces the probability of failure, as non-uniformities average out, increasing read and write margins and resulting in more robust devices. However, large cells reduce 95 the density and increase power and energy consumption. As Table 1 shows, less than 10% of the cache entries are non-faulty for the small cells C1 and C2 at 0.5 V. If the cache is implemented with the more robust C6 cells, however, the percentage of non-faulty cache entries rises to 60%, but at the cost of a 58% increase in area (relative to C1), and the consequent increase 105 in leakage, which is not a suitable option for a large structure such as an on-chip LLC.
In this work, we take Zhou's reliability study as a reference to test our proposals on a wide range of failure probabilities. We will only consider C2 to C6 operating at 0.5 V (our target near-threshold V dd ), as at this voltage, a cache 110 built with C1 cells would have all its capacity compromised. 
Impact of Block Disabling on Large Shared Caches at Ultra-low Voltages
A simple approach to handling hard faults is the disabling of faulty elements.
BD deactivates resources at block (cache entry 2 ) granularity: when a fault is 115 detected at a given cache entry, that entry is marked as defective and it can no longer store a cache block [11] . This technique is implemented in modern processors to enable them to tolerate hard faults [6] .
BD has also been studied for operation at low voltages because of its easy implementation and low overhead [15] . From the implementation perspective,
120
2 In this work, we differentiate between cache block and cache entry: block refers to the transfer unit, the content per se, while entry refers to the physical group of cells that store a block.
6
only one bit per entry suffices to mark the entry as faulty. The main drawback of this approach is that the amount of capacity dramatically falls when the probability of failure increases, as shown in Table 1 Inclusive hierarchies perform poorly when the aggregated size of the private caches is similar to the size of the LLC [17] , and BD exacerbates the problem because of the substantial associativity and capacity degradation in the LLC. where n is the associativity, and p denotes the probability of failure of a cache entry. Figure 1 shows how the associativity degrades as more faulty cells appear on the cache structure. observation is the basis for the techniques we propose in this paper.
Our proposal has been designed for inclusive memory hierarchies, but most of the proposed ideas could benefit non-inclusive and exclusive hierarchies as well.
The objectives of our replacement and promotion algorithms are to assign the non-faulty entries to blocks with reuse and to blocks that are not present in the 160 private caches. On the one hand, these objectives are still valid in a non-inclusive hierarchy; however, their relative importance is different, and our algorithms should consider different priorities for allocation and promotion decisions. On the other hand, our proposal alleviates a specific problem of inclusive hierarchies, such as the need to invalidate a block in a private cache when it is evicted from 165 the shared cache. This problem does not exist in non-inclusive hierarchies, and therefore our proposal is not applicable in this specific aspect.
Note on Figure 1 that, when using cell types C3 and C2, 0.6% and 18.9% of the sets have no operative ways, respectively. To be able to offer a complete comparison with BD, we assume that at least one of the ways in each set is 170 non-faulty, although this is not a requirement for the techniques we present in this paper, and the LLC is able to operate even when all the ways of a set are faulty. for our configuration, see Table 2 in Section 6), and increasing the cell size by 33% (assuming 8T SRAM cells [18] ) will only increase the total area of a cache bank by 2%. Since using sophisticated ECCs could increase the access latency of the tag array, while using resilient tag cells involves little overhead, we opt for the 210 latter. This approach is also consistent with prior work [9, 10] . Moreover, many of today's CPUs use different cell types for tag and data arrays [19] . Contrary to other proposals, our mechanism works even when all entries of a set are faulty.
Exploiting Inclusive Hierarchies to
Contrary to other proposals, our mechanism works even when all entries of a set are faulty. The LLC saves the tags for both faulty and non-faulty entries,
215
maintaining the coherence status of all the blocks, and allowing blocks to be stored in the private levels without the need of a data replica in the shared level.
Hence, it is possible to store a block in the private caches even if all the data ways of the corresponding LLC set are faulty.
BDOT Limitations

220
BDOT, as described above, has two potential limitations, both related to the allocation of blocks to faulty entries.
First, BDOT always forwards requests to blocks allocated to faulty entries to the off-chip memory. However, a block allocated to a faulty entry might be present on-chip, if it is being used by a private cache (L1). This situation 225 is common in parallel workloads, which share data and instructions. In this case, the directory information can be used to orchestrate cooperation among L1
caches. When the directory protocol receives an L1 request to a shared block mapped to a T entry, it forwards the request to one of the sharers of the block, namely, the L1 cache closest to the requester in terms of Manhattan distance.
230
That L1 will serve the block through a cache-to-cache transfer.
Cache-to-cache transfers are already implemented in the baseline coherence protocol for exclusively owned blocks. Hence, no additional hardware is required and a slight modification of the directory protocol suffices to trigger a shared block transfer. So from now on, we assume that BDOT includes this feature.
235
The second limitation comes from allocating blocks to LLC entries without taking into account their T or D nature. Unfortunately, this blind allocation can result in heavily reused blocks being attached to faulty entries. Indeed, if a particular block of the LLC is required repeatedly from an L1 cache (i.e., the block shows reuse), any replacement algorithm will tend to protect it, reducing 240 its eviction chances. Thus, if a block with reuse is initially allocated to a T entry, unless replicated in other cores, all L1 cache misses will be forwarded off-chip by the LLC.
In the next section, we introduce a specific allocation and reallocation policy for BDOT caches that differentiates between T and D entries.
245
Fault-aware Cache Management Policy for BDOT Caches
Conventional cache management policies assume that every cache entry can store a block, while BDOT breaks this assumption: each set in an N -way set associative cache contains T entries that store only tags, and D entries that store tags and data. Keeping in mind the main goal of improving the overall
250
LLC performance under BDOT, this section introduces a fault-aware cache management policy that takes into account the distinct nature of T and D entries, and the reuse pattern of the reference stream. In particular, we seek to achieve the following two goals:
1. To allocate blocks that are most likely to be used in the future to D entries. Prior work has shown that reuse is a very effective predictor of the usefulness of a given block in the LLC [20, 21] . Reuse locality can be described as follows: These goals may be added to any management policy. In this work, we 270 will build on top of a state-of-the-art reuse-based replacement algorithm: NotRecently Reused (NRR) [20] . Next, we describe the baseline replacement in some depth and then we add awareness of the existence of faulty entries.
Baseline NRR Replacement Algorithm
The NRR algorithm requires four states per LLC block, as depicted in 
Reused-based and Fault-aware Management for BDOT Caches
Seeking to guarantee that valuable blocks remain in the LLC, we devise a fault-aware management policy by distinguishing between T and D entries. One option is to promote blocks by reallocating them from T to D entries, if needed, to improve the overall cache performance. The design choices include where 305 the promoted data comes from and which victim is chosen as a target of the consequent demotion. At the same time, we want to continue exploiting reuse in the simple and efficient way offered by an NRR-like replacement algorithm, which 13 is unaware of faulty entries. Thus, our goal is to design a comprehensive cache management policy, merging reuse exploitation and faulty entry management.
310
Below, we elaborate on the two mechanisms that are key to achieving this, namely block insertion/replacement and block promotion/demotion.
Insertion and Replacement of Blocks
On a first insertion (LLC miss), an incoming block has not shown reuse, and hence allocating it to a T entry seems a reasonable idea. Figure 3a shows an 315 example of a cache block to be inserted in a 4-way cache set with two T entries (those storing q and r tags) and two D entries (those storing p and s tags and the corresponding P and S data). A victim is selected among the blocks allocated to T entries. The baseline replacement policy dictates which of those blocks (Q and R) is selected for replacement. This is equivalent to predicting that the 320 incoming block X is not going to be reused. If the reuse pattern of the block is mispredicted, block X should be reallocated to a D entry, to reduce its access time and transfer energy in future L1 misses. This reallocation will be performed using the promotion mechanism we detail in the next subsection.
Dealing with first insertions this way is very simple but has a clear disad- problem is not easy. We explored various adaptive mechanisms in which some D entries are used as T . However, it is difficult to determine the optimal number of T entries, this being highly dependent on the workload. After carrying out 335 several experiments (data not shown in Section 7, for the sake of brevity), the performance returns were disappointing given the required complexity.
Given that our promotion mechanism reallocates reused blocks to D entries and non-reused ones to T entries (as we will see in the following subsection), we realized that the baseline NRR replacement policy itself suffices to achieve our 340 initial goals because it protects reused blocks. Since NRR gives lower priority to non-reused blocks, blocks allocated to T entries will have more chances to be evicted. This implies that, with a balanced distribution between T and D entries, an incoming block will have a higher probability of being inserted in a T entry than in a D entry. If the number of T entries in a set is very low, 345 and even if there are no T entries in a set, the mechanism still works correctly.
NRR periodically resets the reuse bit of those blocks not present in private caches, so some D entries become replacement candidates with the same priority as T entries. Hence, the initial insertion does not necessarily have to consider the nature of the entry, and our implementation relies only on the baseline 350 replacement policy to select the victim block.
Promotion and Demotion of Blocks
A blind allocation of blocks to cache entries may result in valuable blocks (i.e., those with reuse) being initially allocated to T entries, and vice versa. However, this undesirable situation can be tracked on the fly through the reuse footprint,
355
and reversed by swapping a T entry with a D entry: when a block allocated to a T entry shows reuse, we will promote it to a D entry. Promotion involves a complementary demotion of the block stored in the selected D entry.
To select which block is demoted, we also rely on reuse and L1 presence information. Reused blocks should be kept in the LLC, but unlike in the 360 baseline replacement policy, block demotion does not involve an LLC tag eviction.
Furthermore, if the block is present in L1, losing the contents of the LLC is not critical, because there is at least one on-chip copy of the block, which can be supplied by a cache-to-cache transaction. Thus, to maximize the amount of on-chip data, the demotion algorithm will select the victim block among those 365 present in L1. Among the blocks in L1, non-reused ones should have more chances of being demoted.
Note that the promotion of a block can be performed at two different times: at reuse detection (i.e., on a second L1 request to a block stored in a T entry) or after the second eviction from L1 (i.e., on eviction after reuse). Performing the 370 promotion after the second request from L1 duplicates the content, as a copy of the block is also stored in a private cache, whilst performing the promotion after the L1 eviction meets the goal of maximizing the amount of on-chip data.
Therefore, we opt for the latter and trigger promotions only after L1 evictions, non-silent block data evictions being necessary.
375
The promotion/demotion process is illustrated in Figure 3b . When block R, which is stored in a T entry, is evicted from the L1 cache and selected for promotion (i.e., its reuse bit is set), we select a victim among the demotion candidates (P and S in Figure 3b ). Once the victim is selected (P in our example), we swap the cache contents in three steps: 1 discard the data entry P, writing 380 back the data to memory, if dirty; 2 swap p and r tags; and 3 copy the data (R) to the available D entry, which was occupied by the demoted block (P). Figure 4 ) does not take into account the nature of the entry, and it solely depends on the victim selection arising from reuse and L1 presence; i.e., it only depends on the baseline replacement algorithm. After insertion, blocks will move along NR-C, NR-NC, R-C, and R-NC superstates as they would do in a cache 390 without considering faulty entries.
Summary and Implementation
To guarantee that high value blocks-those showing reuse-remain in the LLC, the policy reallocates them from T to D entries when they are evicted from the L1 and reside in a faulty LLC entry: R-C-T state. After L1 eviction, blocks in R-C-T trigger a promotion, which results in the transition to an R-NC-D 395 state and reallocation to a D entry, with the consequent demotion of another block within the set to a T entry. A block being demoted can be in any of the superstates, and according to the victim selection algorithm, we first demote blocks that are present in the private levels, in order to maximize the amount of data available in the on-chip hierarchy. As a secondary objective, the policy For example, instead of relying on the reuse information of the blocks, a future use predictor [22] could be utilized to decide which blocks should be allocated to 420 D entries, or a dead block predictor [23] could be used to indicate which blocks may be demoted to T entries, but these solutions add complexity to the cache logic as well as requiring more storage overhead.
Methodology
Overview of the System
425
Our baseline system consists of a tiled CMP, with an inclusive two-level cache hierarchy, where the second level cache or LLC is shared and distributed among the processor cores. Tiles are interconnected by means of a mesh. Each tile has a processor core with a private first level cache (L1) split into instructions and data, and a bank of the shared LLC, both connected to the router ( Figure 5 ). Similarly 430 to most CMP, the write-policy for L1 data caches is write-back because other policies, such as write-through, may collapse the interconnection network [24] .
The mesh will have to convoy every single store from the cores to the LLC banks to guarantee content inclusion. The CMP includes two memory controllers located at the edges of the chip. Table 2 shows the parameters of the baseline 435 processor, memory hierarchy, and interconnection network. As in previous studies [9, 10], we assume that the LLC tag arrays are hardened by using upsized cells such as 8T [18] . The baseline LLC replacement policy is Not-Recently Used (NRU) [25] extended with private copy protection [17] . We implement this protection by using coherence directory information updated by 450 non-silent L1 block evictions.
Experimental Set-up
Regarding our experimental set-up, we model the CMP system described in Table 2 . We use Simics [26] in combination with GEMS [27] to simulate the on-chip memory hierarchy and interconnection network, and DRAMSim2 [28] 455 to simulate the DDR3 DRAM in detail. To obtain timing, area, and energy consumption, we use the McPAT framework [29] for the on-chip components, and DRAMSim2 for the DRAM module. We extend the Ruby module (GEMS) to simulate the cache swaps in detail in order to take into account their dynamic energy overhead. We use a set of 20 multiprogrammed workloads built as random combinations of the 29 SPEC CPU 2006 applications [30] , with no special distinction between integer and floating point programs. Each application appears on average 5.5
times with a standard deviation of 2.5. Programs were run on a real machine until completion with the reference inputs. Hardware counters were used to 465 locate the end of the initialization phase. Every multiprogrammed mix was run for as many instructions as the longest initialization phase, and a checkpoint was created at this point. We then run cycle-accurate simulations including 300 million cycles to warm up the memory hierarchy and 700 million cycles for data collection.
470
We also include a selection of shared-memory parallel applications from with a confidence level of 95% [34] . In other words, the number of samples is 
Evaluation
This section evaluates the effectiveness of the proposed BDOT management technique for LLC caches in terms of MPKI, adding up the misses in all LLC banks and dividing by the aggregated instruction count of all cores. Later, in
500
Section 8, we analyze the impact on system performance, area, and energy.
To assess the effectiveness of our proposals, we include several additional configurations. First, as an upper bound in performance, a robust cache built with unrealistically robust cells (Robust); i.e., cells that operate at ultra-low voltages with neither failures nor power or area overheads, which corresponds to a 505 perfect unattainable solution. Then, we also include block disabling (BD), as our proposal emerges from it. Finally, we add results for word disabling (WD) [10] .
Word disabling is a more complex technique that combines consecutive faulty cache entries to recreate fully functional ones, at the cost of reducing the cache capacity. Section 9 presents a comprehensive discussion of this and other 510 techniques versus our proposals.
In summary, we consider the following configurations:
• Robust: reference system; the LLC is built with unrealistically robust cells. All data are presented with respect to this system.
• BD: system implementing block disabling, as presented in Section 3, with
515
NRU replacement.
• BDOT-NRU: system implementing block disabling with operational tags, as presented in Section 4, with NRU replacement.
• BDOT-NRR: system implementing BDOT with NRR replacement, as presented in Section 5.1.
520
• BDOT-NRR-FA: system implementing BDOT with fault-aware NRR replacement, as presented in Section 5.2.
• WD: system implementing word disabling with NRU replacement [10] .
As in the case of NRR, the NRU implementation also includes private copy protection. Our detailed results include multiprogrammed workloads (the Figure 6 shows the LLC MPKI results for the multiprogrammed workloads. Figure 7 shows the relative LLC MPKI for the parallel workloads, with 575 respect to the baseline. As with multiprogrammed workloads, BDOT-NRR-FA has a lower average MPKI than BD and non fault-aware implementations of BDOT. In particular, BDOT-NRR-FA improves MPKI with respect to BD by 5%, 5%, 9.6%, 19.2%, and 54.2% on average for C6, C5, C4, C3, and C2, respectively. Comparing with the multiprogrammed workloads, the relative 580 MPKI numbers shown in Figure 7 are larger, moving away from the Robust system to a greater extent for all cell types, even for the winning alternatives (WD and BDOT-NRR-FA). But it is worth noting that the absolute MPKI values for the parallel applications considered are low (Section 6), which makes the relative increases appear more substantial.
Multiprogrammed Workloads
Parallel Workloads
585
Upon closer examination of the results, we can make some interesting observations. Figure 8 shows the LLC MPKI analysis per application for the different cell types. BD is better than plain BDOTs (BDOT-NRU, BDOT-NRR) in C6-C3 cells (C3 in canneal is an exception), while in cell C2 the trend clearly reverses.
On the contrary, BDOT-NRR-FA is better than BD in most cases, being vips Finally, the costly WD shows a similar tendency to that observed with multi-595 programmed workloads, with a relatively constant performance independently of the cell type. In this case, BDOT-NRR-FA beats WD when using C6 or C5 (12.7% and 6.6% lower MPKI values, respectively), but it cannot reach WD performance for C4, C3, or C2 (5.5%, 12.4%, and 33.3% higher MPKI values, respectively).
600
System Impact
This section analyzes the impact of our proposals on the system in terms of performance, area, and energy consumption. As in the previous section, we present results relative to the Robust system and compare with the BD and WD mechanisms. BDOT-NRR-FA shows a performance degradation with respect to the Robust reference system of 1.3%, 2%, 3.4%, 4.3%, and 6.9% for C6, C5, C4, C3, and C2, respectively, or, in other words, a performance improvement with respect to BD of 2%, 2.2%, 2.7%, 3.6%, and 13.1%. better than BD on average, while in Figure 7 , the average MPKI value with these techniques was higher than with BD. As we already mentioned, the LLC MPKI 620 for the parallel applications in the baseline system is small (Section 6), and small MPKI increases with respect to this system appear relatively large in Figure 7 .
Nevertheless, for C3, streamcluster has a dramatic speedup degradation with BD. This is due to the large number of back invalidations to L1 blocks to force directory inclusion (inclusion victims). Specifically, in this application, the 625 number of invalidations to L1 blocks decreases 20 times when implementing BDOT. The MPKI numbers are similar, but the number of instructions executed differ considerably. For this application, we observe a performance improvement of 6.1% when using BDOT-NRU (6.2% for BDOT-NRR), with respect to BD.
On average, BDOT-NRR-FA shows a similar performance to BD for C6 and
630
C5, where the performance degradation with respect to the reference system is 2.2% and 2.9%, respectively, and for C4, C3, and C2, the performance is better, by 1.8%, 7.1%, and 34.6%, respectively. BDOT-NRR-FA and WD have similar performance (within 1%), except for in the case of C2, for which WD achieves a 3.1% better performance.
635
In summary, BDOT-NRR-FA is an excellent choice for caches with different numbers of defective entries, as it achieves as good performance as more complex fault-tolerant techniques without adding any extra storage overhead to the cache. Larger SRAM cells are less likely to fail, but at the cost of larger areas and 640 higher power consumption. Even the largest cell considered by Zhou's study (C6), which requires a 41.1% larger area than C2, is far from reaching fully functional performance: 40.1% of the cache entries are faulty at 0.5 V (Table 1) .
Area and Energy
Our fault-aware mechanism has a minimal impact on area. Only two extra bits suffice to implement BDOT-NRR-FA: one bit marks entries as defective (as 645 in BD), and the other one stores the replacement policy (i.e., NRR) information.
Thus, no extra storage overhead is added compared to the BD system.
Minimizing area helps to reduce energy in the LLC. Signals traveling smaller distances require less dynamic power for switching, and, most importantly, small cells consume less static power. To estimate the sub-threshold current, I sub ,
650
causing the static consumption, we assume that I sub is directly proportional to the transistor width of the cells considered, and estimate it with respect to C2 [4] . For the unrealistically robust cell, we assume that it is the same size as 29 C2, but with a null probability of failure. Energy consumption also includes the dynamic overhead of LLC block swaps and L1 clean data eviction required by 655 the fault-aware BDOT policy. Finally, we account for both the on-chip power and the off-chip DRAM power. The energy results shown above do not consider any block power gating 675 technique [38] . Assuming a more aggressive approach, where fine-grained block power gating is affordable [39] , the benefits of BD-based techniques in terms of power and energy will be enhanced, as faulty entries do not consume static power during operation. Applying this technique, the EPI with BDOT-NRR-FA would be 6.2%, 6.7%, 7.2%, 6.3%, and 5.5% lower for the multiprogrammed 680 workloads than the EPI values in Figure 10 with C6, C5, C4, C3, and C2 cells, respectively. The same tendency is observed in the parallel workload results. 
Related Work
Conventional 6T SRAM cells fail to operate reliably in the near-threshold regime, as the ratio constraints for read stability and writability of transistors 690 cannot be guaranteed, especially in view of V th variations. Prior proposals to mitigate the impact of SRAM cell failures due to parameter variation at ultra-low voltages can be categorized into circuit and architectural solutions.
Circuit solutions include methods that improve the bit cell resilience by increasing its size [4] , or by adding assist/spare circuitry [18, 3] . Increasing the 695 cell size or the number of transistors per cell comes at the cost of significant increases in the SRAM area (lower density) and power consumption. Since the LLC accounts for much of the die size, increasing its area (e.g., ST SRAM cells [3] double the area of the SRAM structure) is not a design option. structures or re-mapping mechanisms, only minor changes to the coherence protocol and replacement policy.
In the context of ultra-low voltages, Keramidas et al. use a PC-indexed spatial predictor to orchestrate the replacement decisions among fully and partially usable entries in first-level caches [46] . We based our allocation predictions on 750 reuse patterns, which simplifies the hardware, and we do not consider the use of partially faulty entries.
Regarding the implementation of our techniques, it is worth referring to the work of Jaleel et al. [17] . In inclusive hierarchies, the private caches filter the temporal locality and hot blocks (i.e., blocks being actively used by the techniques such as RRIP [47] . Our proposal uses NRR as the base replacement policy.
Summary and Conclusions
770
Voltage reduction has been the primary driver to reduce power during recent 
780
The reduction in associativity and capacity experienced by inclusive LLCs extended with BD has two specific drawbacks in multicore systems. First, the number of inclusion victims in private L1 caches increases. Second, the MPKI values also grow, increasing LLC miss latency and main memory energy consumption.
785
To cope with the first problem, we propose Block Disabling with Operational Tags (BDOT), which uses robust cells to implement the LLC tag array. BDOT enables some cache blocks to be only in private levels by simply tracking their tags (T entries), and extends the existing cache-to-cache coherence service to clean blocks. Thus, with regard to inclusion victims, the LLC associativity is 790 fully restored. BDOT requires a small amount of extra control, and it adds no storage overhead to BD (the bit that marks operative entries sufficing to distinguish between LLC T and D entries). Any replacement algorithm may work with BDOT, and we have tested NRU and NRR, two low-cost state-of-the-art proposals for LLCs.
795
After the last copy L1 eviction of a block tracked by a T entry, a future reference to this block will involve an off-chip access, even though we know that reuse chances are high. Hence, we tackle the second problem from the key observation that we can preserve the data cached on-chip by exchanging the valuable, just evicted T entry block (promotion), for an L1-present D entry block 
