Energy efficiency has now become the primary obstacle in scaling the performance of all classes of computing systems. Low-voltage computing, specifically, near-threshold voltage computing (NTC), which involves operating the transistor very close to and yet above its threshold voltage, holds the promise of providing many-fold improvement in energy efficiency. However, use of NTC also presents several challenges such as increased parametric variation, failure rate, and performance loss. This article surveys several recent techniques that aim to offset these challenges for fully leveraging the potential of NTC. By classifying these techniques along several dimensions, we also highlight their similarities and differences. It is hoped that this article will provide insights into state-of-the-art NTC techniques to researchers and system designers and inspire further research in this field. 
INTRODUCTION
Recent trends of process technology scaling have made energy efficiency the first-order design constraint in all computing systems ranging from portable embedded systems to large-scale data centers and supercomputers. The number of just mobile systems has now exceeded the population of the earth [Mittal 2014c ], which makes their total power consumption very large. Similarly, the electricity consumption of U.S. data centers has grown from 61 billion kWh (kilowatt hours) in 2006 [Mittal 2014a ] to 91 billion kWh in 2013 and is likely to grow to 140 billion kWh in 2020 [NRDC 2013] . Just the top two supercomputers in the November 2014 version of the Top500 list of supercomputers consume a total of 26 megawatts of power [Top500 2014] , which is sufficient to fulfill the demands of a city of several thousand residents. The increased power dissipation, however, greatly increases the operational costs and complexity of these systems. The worldwide expenditure on enterprise power supply and cooling has reached more than $30 billion [Patel and Ranganathan 2006] . Very large levels of power dissipation cause reliability issues and necessitate advanced cooling techniques (e.g., liquid cooling, submersion cooling, etc.), which may be extremely costly and even infeasible for most computing systems. While the number of transistors on a chip increases at an exponential rate (e.g., 10 billion transistors on a recent chip [Oracle 2014 ], up from just 2.3K transistors on a processor in 1971), the effectiveness of power and thermal management solutions does not scale as well. These trends and requirements have forced designers to pursue aggressive energy optimization techniques.
Since supply voltage has a strong influence on both static and dynamic energy, lowvoltage operation presents an attractive avenue for achieving high energy efficiency. Specifically, near-threshold voltage (NTV) operation, which involves scaling of supply voltage very close to and yet above the threshold voltage (V th ) of a transistor, has the potential to provide a 5 to 10× improvement in energy efficiency Khare and Jain 2013] . While standard DVFS (dynamic voltage/frequency scaling) typically reduces the supply voltage to no lower than 70% of the nominal voltage level, NTV operation scales the voltage to nearly 25% to 35% of the nominal level, which is close to V th . NTV operation aims to achieve a fine balance between performance and energy efficiency, since lowering the voltage further (subthreshold operation) leads to extremely slow transistors with a high fault rate, while increasing the voltage diminishes the energy savings rapidly (refer to Section 2 for more details). In fact, at NTV, even small changes in voltage (e.g., 100 to 200mV) lead to large changes in frequency (e.g., 400 to 800MHz) . These tradeoffs are also illustrated in Figure 1 . Given the tradeoff between performance, energy, and reliability factors at NTV [Khare and Jain 2013; Pu et al. 2010] , it is clear that intelligent techniques are absolutely essential to leverage the full potential of NTC and ensure its deployment in mainstream processors.
Contribution and article organization:
In this article, we survey several recent techniques that utilize NTC and/or address challenges associated with it. Figure 2 shows the overall organization of this article. We first discuss the motivation behind the use of NTC (Section 2.1) and highlight important trends and challenges that are worthy of future investigation (Section 2.2). After this, we summarize some key ideas that are common to many NTC techniques (Section 3 and Table I ). We then discuss the design and management approaches used for NTC (Section 4 and Table II ). Further, we organize the techniques based on their target system component (Section 5 and Table III ) and the objective (e.g., performance, energy efficiency) they seek to optimize (Section 6 and Table IV) .
In these sections, we classify the techniques on several important parameters to underscore their similarities and differences. Note that the techniques presented in these sections are tightly interconnected, and although they have been discussed under a single category, many techniques fall in multiple categories. Since different techniques have been evaluated using different methodologies, we mainly focus on their key insights and do not present their quantitative results. In this article, we present architecture and system-level techniques, in contrast with a few previous works that review circuit/device-level low-voltage techniques Gupta et al. 2010] . Finally, Section 7 presents the conclusion and future challenges. The aim of this article is to provide researchers insights into the workings of NTC techniques and motivate them to create breakthrough inventions for enabling their adoption in all classes of computing systems. We hope that this article will be useful for chip designers, computer architects, system developers, and other researchers.
BACKGROUND AND MOTIVATION
In this section, we briefly discuss the opportunities and obstacles associated with NTC. deduce the optimal operating voltage under the voltage-scaling scenario and show that it lies in the NTV region. They note that the energy savings provided by voltage scaling allow aggressive task parallelization; however, the maximum energy savings achieved by such an approach are limited due to several factors, such as architectural overheads (e.g., coherence and memory energy), nonideal parallelization, and so forth. They study the limit of voltage scaling efficiency and show that under realistic architecture/technology/application scenarios (e.g., intercore communication, parallelization efficiency, leakage power consumption, etc.), the optimal operating voltage lies in the near-threshold region, which is approximately 200 to 400mV above the threshold voltage. They also show that this voltage range holds good across six process technology generations and across transistor threshold voltage selection.
Scope for Near-Threshold Computing
Near-threshold computing is a promising approach for a variety of reasons, as we show to follow.
2.1.1. Varying Usage Pattern. To fulfill the demands of peak performance and service level agreements (SLAs), computing resources are typically overprovisioned; however, the average utilization of these resources remains low [Ahuja et al. 2012; Mittal 2014a; Mittal 2014b ]. Moreover, applications or application phases with limited parallelism or computational requirements do not utilize the processor resources fully [Mittal and Vetter 2015b] . Use of NTC in selected components or application phases can help in avoiding such inefficiencies. For example, such techniques can allow trading off cache capacity for performance [Ghasemi et al. 2011; or for energy or reliability . In fact, the execution profile of computing systems such as data centers or battery-powered embedded systems can be broadly divided into two types: a high-performance mode, relatively short in duration, where high or moderate voltage is used to service high computational demands, and a low-power mode that occupies a larger fraction of execution time, where reduced computational demands allow use of low voltage for saving energy Mittal 2014a ].
2.1.2. Limited Power Budgets. Many computing systems such as body sensor network, biomedical systems , and wireless systems have a strict power budget of milliwatts and even microwatts [Mittal 2014c ]. On the other end of the spectrum, Exascale machines, which will perform 10 18 operations per second, have a power budget of 20 megawatts . Clearly, low-voltage operation is not merely attractive but even essential in most systems to meet their power budgets.
2.1.3. Limitation of Other Power Management Strategies. While low-voltage computing has its own limitations (as discussed later), alternative power management strategies are likely to be less effective or present even bigger challenges. Some techniques such as use of specialized processors or nonvolatile memory are likely to require significant architectural redesign and investments [Mittal and Vetter 2015b] . Liquid/advanced cooling strategies are clearly infeasible for portable systems (e.g., laptops or cell phones) and may be used only in remotely operated servers. Energy-aware task-scheduling approaches, workload consolidation, data compression, and so forth [Mittal 2014a ] may be useful only in limited domains and yet provide small energy savings. Compared to these, voltage scaling presents less complexity and may be readily deployed in a wide range of systems.
Challenges in Using Near-Threshold Computing
Despite its promises, use of near-threshold computing presents several challenges, as we summarize next.
2.2.1. Increased Vulnerability to Parametric Variation. Reduction in voltage aggravates the effect of process, voltage, and temperature (PVT) variation , and NTC techniques must account for these effects (see Table IV ). PVT variation leads to within-die and die-to-die differences in several crucial transistor parameters such as threshold voltage and effective gate length. The increased nondeterminism has a negative effect on yield, performance, and power and thermal management. Reduction in voltage makes the cells more susceptible to soft errors since the charge required to flip the value is also reduced [Mittal and Vetter 2015a] . These faults need to be detected using postmanufacturing and boot time tests, BISTs (built-in self-tests), and error-correcting code (ECC) or similar schemes, which incur time and cost overheads.
2.2.2. Higher Vulnerability of Memory Structures. Low-voltage operation affects memory structures more than the logic elements [Bacha and Teodorescu 2014; . This is due to the fact that the memory structures are optimized for area and have lower voltage margins, which can be more easily violated at low voltage due to parametric variations. Thus, SRAM structures show a steep rise in failure rate at low voltage, and hence, they also limit the extent of voltage scaling in the processor. To avoid this, different voltage islands can be used for logic and memory Khare and Jain 2013] , which allows for operating the core at lower voltage and cache at higher voltage for achieving both reliability and energy savings. This, however, requires use of voltage-level converters, which consume chip area, have only suboptimal power efficiency, and may require hundreds of cycles to change the voltage ].
2.2.3. Challenges in Multicore Processors. For multi-/many-core processors, parametric variation may manifest as core-to-core (C2C) variation, and these effects are exacerbated by low-voltage operation. This may render conventional core/application unaware management policies (e.g., cache replacement, scheduling, etc.) ineffective. Similarly, variation-induced timing errors in a pipeline are mitigated by a flush-rollback process, and its overhead in an SIMD (single-instruction multiple-data) pipeline may be much larger than that in a scalar pipeline ].
2.2.4. Limitation of Error-Correcting Codes. Due to the latency-optimized design of firstlevel caches, use of ECC in them may lead to large performance loss , and hence, special approaches are required for using ECC in them . Also, the superlinear increase in the failure rate on lowering the voltage ] may surpass the correction and even detection capability of existing ECCs in lower-level caches. To avoid this, stronger ECCs may need to be used, which incurs significant hardware and runtime overheads (refer to Table II).
2.2.5. Limitation of Block Disabling Schemes. When faults surpass the correction capability of ECC logic, designers typically resort to disabling the faulty structures to continue execution (see Table II ). Disabling the structures (e.g., cache blocks), however, leads to rapid capacity degradation and yet only provides a short-term solution. Also, block disabling increases costly off-chip accesses and requires careful management of dirty data. Further, by virtue of leaving a cache with variable associativity, it affects the performance predictability.
The granularity at which blocks are disabled has a crucial impact on performance; for example, coarse-grained disabling requires smaller metadata overhead (e.g., fault map) but degrades capacity quickly, and the opposite is true for fine-grained disabling. Further, modern processors feature multimegabyte last-level caches, and hence, testing millions of blocks/subblocks at different voltages repeatedly during runtime or at each reboot becomes extremely cumbersome.
2.2.6. Overhead of NT-Tolerant Components. Compared to conventional circuit designs (e.g., 6T SRAM), use of NT-tolerant circuits such as 8T [Chang et al. 2005] , 10T [Calhoun and Chandrakasan 2006] , or Schmitt trigger (ST)-based SRAM [Kulkarni et al. 2007] can lead to magnitude order reduction in failure probability [Ghasemi et al. 2011] . For this reason, these cells have been used in commercial processors; for example, Intel's 45nm Nehalem processor uses 8T SRAM cells in L1 cache [Kumar and Hinton 2009] . However, these cells incur significant area overhead (e.g., 100% for ST SRAM) and consume higher access latency and leakage power [Pu et al. 2010] . Since caches already occupy between 25% and 50% of the chip area in modern processors [Mittal 2014b ], increasing their area further may require reduction in area budget for cores. Conversely, for a fixed cache area, use of higher-sized cells decreases the cache capacity. To offset the area overhead of NT-tolerant cells, some techniques use heterogeneous (e.g., 6T and 8T) cache designs (refer to Table II) and activate only NT-tolerant (e.g., 8T) cells in low-voltage mode, which also degrades their capacity. Clearly, in either case, use of NT-tolerant components may reduce the maximum supportable cores on a chip.
2.2.7. Redesign and Careful Evaluation Required at NTV. Due to unique challenges presented at NTV, partial retrofitting of existing cache management schemes or evaluation approaches for NTV are likely to be insufficient. For example, note that due to increased latency and reduced capacity at NTV, a cache management scheme (e.g., for placement, movement, and replication of data) that is optimized for nominal voltage may not perform optimally at NTV. This is due to the fact that the reduced capacity affects replication decisions and presents a tradeoff between energy loss due to increased off-chip accesses and energy savings due to NTV operation.
Similarly, Pu et al. [2010] note that research works on NTC may overlook crucial assumptions, and due to this, the claims of energy reduction obtained by these works may be overestimations. For example, blindly comparing results across different technology generations or using a nonstandard definition of threshold voltage can lead to exaggerated energy-saving claims. Also, NTV operation leads to severe throughput degradation, which may not be easily offset by deep pipelining and parallelism, since the number of parallel units required increases sharply at low voltage. This, however, demands additional area, which increases layout complexity and yield loss. The maximum performance improvement achievable is also limited by Amdahl's law. Further, assuming that the supply voltage and energy of memory can scale as well as that of standard logic cells can lead to inaccurate claims. Furthermore, the temperature behavior of circuits at NTV is totally different from that at nominal voltage, and failing to separately characterize the temperature behavior at different voltages can lead to oversight.
Thus, a careful design is definitely required to balance the advantages and disadvantages of low-voltage operation. This article surveys many intelligent techniques that aim to fulfill this need.
KEY IDEAS OF NEAR-THRESHOLD COMPUTING TECHNIQUES
While different NTC techniques vary in their scope and features, several essential ideas are common to them. In this section, we review these key ideas to provide insights to the reader and we show their use in NTC techniques in Sections 4, 5, and 6. Table I summarizes these ideas; we now discuss them.
(1) Since regular (i.e., 6T) and NT-tolerant (e.g., 8T) cells have different performance and reliability characteristics, heterogeneous-cell caches migrate data between regular and NT-tolerant ways or access those ways sequentially (instead of "in parallel") ] to achieve different performance, energy, and reliability tradeoffs. (2) The blocks that show faults due to NTV operation are disabled. The disabling can be done at fine granularity instead of coarse granularity; for example, on a fault in a cache word, only that word needs to be disabled instead of disabling the entire cache block Choi et al. 2011; Mahmood and Kim 2011; . Then, multiple blocks that and do not have errors in the same position can be paired to achieve a single working block . Also, partially faulty or nonfaulty blocks can be used to store error correction information for faulty blocks , or address remapping can be used to avoid accesses to faulty ways Choi et al. 2011] . (3) Different types of data or operations have different criticality. For example, dirty data are more critical than clean data since an error in clean data of a cache can be corrected by fetching them from main memory, but an error in dirty data cannot be corrected as this is the only copy of the data in the memory hierarchy. Based on this, more critical portions can be allocated to memory or a core with higher reliability (refer Table I ). (4) Some techniques dynamically adjust the voltage value to find just the right voltage that strikes a balance between energy savings and low error rate Teodorescu 2014, 2013; . Other techniques use multiple voltage domains for different cores, or for memory cells and logic, to suit their requirements . (5) With decreasing voltage, the failure rate increases, and hence, higher error protection is required at lower voltage regions. To achieve this, stronger ECC [Maric et al. 2013b; or a higher number of redundant data copies [Yalcin et al. 2014b] may be used at low voltage. Also, in low-voltage regions, only NT-tolerant , and Maric et al. [ , 2013b Cache with only NT-tolerant cells , , Khare and Jain [2013] , and Kumar and Hinton [2009] NT-tolerant cells in tag array , , , , and Management/optimization approaches Disabling faulty or other specific cells , Alameldeen et al. [2011] , BanaiyanMofrad et al. [2011, 2013] cells may be kept activated and normal cells may be deactivated Ghasemi et al. 2011 ] to avoid a high error rate. (6) Under voltage scaling, different cache lines show different numbers of failures. To achieve a balance between correction overhead and error rate, stronger ECC may be used for lines with a larger number of failures, weaker ECC may be used for lines with few failures, and line disabling may be used for lines that show a higher number of failures than can be corrected . Alternatively, lines with many failures may be disabled and those with few failures may be corrected by ECC , or lines showing even a single failure may be disabled to avoid ECC overhead .
DESIGN AND MANAGEMENT APPROACHES FOR NEAR-THRESHOLD COMPUTING
In this section, we discuss several techniques for designing and managing the processor under the near-threshold operation regimen. Table II provides an overview of these techniques; we now discuss them briefly.
Using Heterogeneous Cell Caches
Several researchers use both conventional (e.g., 6T) and NT-tolerant cells (refer to Section 2.2.6) for designing cache/memory to bring the best of both together. We now discuss some of these techniques. Note that since the tag array consumes much less area than the data array, many research works that use ECC-, replication-, or blockdisabling-based approaches for improving reliability of the data array assume that the tag array is designed using NT-tolerant cells (see Table II ). present a heterogeneous SRAM cache designed using both 6T and 10T transistors. For example, an eight-way cache can be designed using all 6T cells, or with five 6T and two 10T ways, or with seven 6T ways and one 10T way, for exercising different area-capacity tradeoffs. This is shown in Figure 3 . For heterogeneous caches, they consider different cache access schemes. In the "parallel" scheme, all ways are accessed in parallel, and in the "sequential" scheme, the first 6T ways are accessed, and in case of a miss in 6T ways, the 10T ways are accessed. The "swap" scheme works the same as the "sequential" one, except that on a hit in 10T ways, the data is swapped with the LRU line in the 6T ways. This scheme aims to bring hot data in 6T ways to serve most hits from those ways. For applications with limited locality, the swap mechanism can lead to a large number of data movement operations. To avoid this, they propose an adaptive technique that tracks the hits across both 6T and 10T ways and uses this to activate/deactivate the swap operations. They show that their technique provides better performance and energy efficiency than the other three schemes mentioned previously. present a cache design where large and energy-hungry (strong) SRAM cells are replaced with energy-efficient and smaller (weak) SRAM cells in certain cache sets. For the heterogeneous-cell cache used by , they change 6T S and 10T S cells with 6T W and 10T W cells, where subscripts S and W refer to strong and weak cells, respectively. To maintain the same reliability levels despite potentially faulty cache lines, these sets are enhanced with extra cache lines in an additional structure such as a victim cache. By virtue of this, the number of fault-free lines in each set becomes the same as that in the baseline, and thus, their technique provides strong timing guarantees required for worst-case execution time (WCET) estimation. Further, use of weaker cells helps in achieving significant energy savings. Maric et al. [2013b] present a cache architecture that aims to achieve low energy and high reliability. Assuming the 6T+10T heterogeneous-cell cache used by as the baseline, they propose replacing energy-hungry 10T cells with smaller and more energy-efficient 8T cells. Depending on different workload requirements, the cache can work in two modes, high-performance (HP) and ultra-low energy (ULE), where HP and ULE are characterized by high/moderate and near-threshold voltages, respectively. Compared to a baseline with no error coding, they propose using SECDED (single error correction and double error detection) in ULE mode and no error coding in HP mode (i.e., turning off SECDED), since 8T cells are less reliable than 10T cells at near-threshold voltages, which demands provision of stronger reliability. This is illustrated in Figure 4 . Thus, by intelligently adapting the error codes, their technique guarantees the same performance and reliability levels as the baseline while saving energy by virtue of using energy-efficient 8T cells.
Ghasemi et al.
[2011] present a low-voltage LLC (last-level cache) architecture that exploits the DVFS characteristics of workloads to achieve both high performance and low minimum supply voltage. They note that under the DVFS scheme, when the load level is higher, the processor spends a large fraction of its runtime in high frequency/voltage states. They design an LLC with a spectrum of cell sizes. At low voltages, only large cells are used to achieve low failure rates, which provides energy savings, and the performance penalty of having reduced LLC capacity is small since the processor runs at lower frequencies at lower voltages. To achieve high performance, voltage is changed to be high enough so that failure rate of even small cells is in an acceptable range, which provides sufficient LLC capacity. With decreasing operating voltage, subsets of cells are disabled in order of size beginning with the smallest cell (refer to Figure 4 ). Thus, without using large cells for an entire LLC, their technique facilitates low-voltage operation by using dynamic adaptation. present an ultra-low-power multicore architecture for an eHealth monitoring system that requires collecting biomedical signals using highly parallel computations at low voltage. They note that the memory requirement of the monitoring system varies significantly in different phases. In the sensing phase, which constitutes the larger fraction of overall time, memory needs to be just sufficient for storing the sampled data, while larger memory is required in the compression phase for computation and temporary storage. They propose a hybrid memory architecture consisting of 6T and 8T SRAM banks. In the sensing phase, voltage is switched to 600mV. For providing reliable operation, only 8T banks are activated and 6T banks remain idle in the data-retentive mode (refer to Figure 4 ). In the compression phase, voltage is switched to 1.2V and both 6T and 8T banks are activated to provide higher performance. present a heterogeneous-cell cache architecture that provides an energy advantage of low-voltage operation along with the performance benefits of a large cache. The two types of cells used show different robustness to failures at low voltages, and hence, they are used differently. Only clean data are stored in nonrobust cells, which are protected using simple error detection mechanisms, since in case of an error, the correct data can be obtained from a lower-level cache or memory. Dirty data are stored only in robust cells (refer to Figure 10 in Section 6.5) and the replacement policy is modified to ensure this. The write misses are allocated to robust lines, and read misses are allocated to nonrobust lines. The energy savings provided by low-voltage operation enables utilization of more active cores, which makes the lowest-voltage operating point the one having the highest performance. note that due to different activity factors and leakage rates for memory cells and logic, operating them in the same voltage region leads to suboptimal energy efficiency. Hence, they propose to operate them in different voltage domains. They explore cluster organization where K slower cores are connected to the same faster cache, which serves these K cores by running K times faster than the cores. This is achieved by assigning suitable V dd and V th to cores and caches. With a rising value of K, cache contention and access energy increase, although due to increasing sharing between cores, communication overhead is also reduced. Due to this tradeoff, a value of K = 2 (i.e., two cores per cluster) is found to provide the highest energy efficiency for multithreaded benchmarks. They also explore the effect of separately clustering instruction and data (I and D) L1 caches on different applications. They observe that since I-cache has a high access rate and low miss rate, keeping a per-core private Icache provides larger energy savings. By comparison, due to a lower access rate, higher miss rate, and larger data sharing, D-cache is more suited for clustering. propose a technique that works by reconfiguring its internal organization to tolerate a large number of SRAM errors that arise in the NTV region. Their technique partitions the cache to multiple autonomous islands with various sizes that function correctly without borrowing redundancy from each other. Each island is a group of physical cache word lines that include spare word lines divided into multiple redundancy units. These spare units are used to achieve fault-free operation of other word lines in the same group. They use a clustering algorithm that partitions the cache to the least number of islands such that the number of spare lines required for ensuring faultfree operation is minimized, since spare lines do not contribute to useful cache capacity. By virtue of not rigidly binding data and redundancy (unlike other techniques such as ), their technique reduces the overhead of redundancy. In highperformance mode, their technique is disabled to avoid losing cache capacity. They show that their fault-tolerant architecture allows the cache to operate at very low voltages. propose a technique that works by exercising a tradeoff between the latency and capacity of L1 caches under different error rates (due to voltage scaling). They note that in case of high error rates, incurring extra latency to recover and utilize the additional L1 cache capacity is worthwhile, while in case of low error rates, it is better to avoid incurring the latency overhead of error correction for gaining additional L1 cache capacity. Based on this, they propose a private L1 cache design that works in two modes, "correction and disable," where single-bit errors are corrected and multibit errors lead to disabling of the cache line, and "line disable," where a cache line with either single or multibit errors are disabled. This is illustrated in Figure 5 . The first mode incurs additional hit latency to recover cache capacity, and the second mode optimizes hit latency at the cost of cache capacity. For each application, the per-core L1 cache operates in either of these two modes based on the L1 eviction rate and the 46:12 S. Mittal average memory latency, such that overall performance is maximized. They show that their adaptive technique performs better than using either mode alone. propose two techniques to enable cache operation at ultra-low voltages. Their first technique, called word-disable, disables 32-bit words that contain one or more defective bits. By combining nonfailing words in two consecutive ways, one logical line is formed and the position of failing/nonfailing words is stored in the tag. Their second technique, called bit-fix, uses a quarter of the cache ways to store both the location and correct value of defective bits in other ways. The limitation of these techniques is that they reduce the cache size and associativity by 50% and 25%, respectively. Also, they suffer large performance loss with an increasing failure rate. note that disabling faulty storage in cache causes core-to-core variation and performance unpredictability since the application performance may vary depending on the faulty bit location and the core to which it is scheduled. Instead of disabling full cache lines, their technique disables only faulty subblocks (e.g., one cache block has four subblocks). To ensure that applications are affected by the faulty subblocks in a similar manner regardless of the location of such failures, their technique uses dynamic address remapping. For this, addresses are remapped dynamically in round-robin fashion, which ensures that each address is mapped to different cache regions over different time periods. They show that their technique provides better performance and smaller performance variability compared to other techniques (e.g., ).
Architectural Redesign for NTC

Using Block Disabling Approach
Choi et al. [2011] note that in the subblock disabling technique ], accesses to disabled faulty subblocks lead to misses, which harms performance. To avoid this, they propose a technique that aims to match cache access behavior and error patterns. While a cache block resides in L1 cache, their technique records its access pattern at word granularity. When it is evicted from L1 to L2 cache, the access pattern is written to L2 cache. Later, when the cache block is fetched from L2 to L1 cache, the access pattern is also fetched. At this time, the error pattern of candidate cache block locations is compared with the access pattern of fetched data to select the most compatible cache resource, such that the words that were accessed are stored in nonfaulty locations. To increase the chances of matching, they also remap data words within the cache line. In the best case where both access and error are always matched, their technique can almost completely alleviate the performance loss due to low-voltage operation.
Using Replication Approach
Chakraborty et al. [2010] present a technique that ensures reliable cache operation under low voltage by maintaining multiple copies of every data item. Their technique maintains two copies of clean data and three copies of dirty data in the same set, as shown in Figure 6 . On access to clean data, both copies are accessed and compared. A mismatch indicates error, and in this case, a correct value is retrieved from the lower-level cache/memory. On access to dirty data, all three values are accessed and compared and the correct value is obtained using majority voting, since the probability of error in all three copies is negligible. Compared to boot-time detection, the dynamic detection of errors as used by their technique has greater effectiveness since the SRAM errors depend on many runtime factors such as temperature, access frequency, position of hotspots, and so forth. Maintaining multiple copies of data, however, significantly degrades the cache capacity, and hence, their technique is useful only for those applications whose working sets are much smaller than the cache size, such as embedded applications. Also, accessing and comparing multiple copies on each access lead to wastage of energy.
Yalcin et al.
[2014b] present a flexible cache architecture that uses replication to provide different degrees of fault tolerance depending on different fault rates. To
Made 2 copies each of b, e, f, g Evicted g and made 3 copies of e.
Made 3 copies of f. [2014b] also keeps multiple copies, but the number of copies maintained is determined by the voltage value and not whether it is clean or dirty, as used in .
increase the error correction capability of replication schemes, each cache line is divided into multiple partitions at the desired granularity (e.g., word, byte, etc.), which are protected by parity. At nominal voltage, only a single copy of each data item is kept. When V dd is medium-low, two copies of data are kept. On a read access, parity-protected partitions of both copies are compared. In case of mismatch, the parity of each partition is computed and the partition with the correct parity is taken as the useful data. At near-threshold voltage, their technique maintains three copies of each data to tolerate higher fault rates. This is illustrated in Figure 4 . On a read access, bitwise majority voting is used to obtain the correct data. If parity of this data is incorrect, the parities of all three partitions are checked and the partition with the correct parity is taken as the useful data. If even triplicating data leads to uncorrectable partition in the cache line, they use a partition-fix mechanism, which is similar to the bit-fix mechanism by . In this mechanism, a quarter of the cache ways are used to store locations and correct values of defective partitions.
Using Error-Correcting Codes
Miller et al. [2010] use turbo product code to allow NTV operation of cache while trading off some cache capacity to store ECC information. A product code is an ECC composed of multiple short codes that make up a long code; for example, by arranging data in a 2D matrix, short code words are computed from data in each column and each row. A product code is called a turbo product code if iterative decoding of a long code word is done by arranging short code decoders in a cycle; row and column decoders work separately but iteratively exchange their intermediate results. Thus, the orthogonal data layout allows each bit to receive protection in both its column and row. Their technique leverages the power of iterative decoding to achieve strong correction ability with only small latency. The error correction information is stored in a different way of the same set where data are stored. To account for parametric variation, their technique classifies the cache lines based on their vulnerability, and to increase cache capacity, their technique does not allocate protection for fully functional or unrecoverable lines. Also, the protection is disabled in error-free high-voltage operation and is enabled only in the low-voltage region (refer Figure 4 ). Compared to a conventional ECC, their technique incurs much smaller area overhead for storing ECC and also provides larger energy savings. Alameldeen et al. [2011] note that the probability of multibit errors in a cache line is much smaller than that of zero/single-bit errors. Based on this, they present a technique that uses variable-strength ECC to achieve a balance between protection provided and overhead incurred by ECC. For lines with zero failures or one failure, a simple and fast ECC is used, and for lines with multibit failures, strong ECC is used, which incurs additional latency and area. Only a few lines need such protection, which keeps the overhead of strong ECC small. To determine which lines will exhibit multibit failures, they use a dynamic cache characterization mechanism that classifies the cache lines during the first transition to low-voltage mode and allocates additional ECC bits for lines that exhibit multibit failures. They show that their technique provides lower ECC overhead than fixed-strength ECC schemes for comparable protection. By combining their technique with a scheme that disables cache lines showing a larger number of errors than can be corrected (refer Figure 5) , they achieve even further energy savings. show that at low voltage, 64B cache lines typically contain only one hard faulty cell and the probability of finding multiple faulty cells is small. Based on this, they propose using DECTED (double error correction, triple error detection) code, which allows for providing correction for one-bit hard error and one-bit soft error. A cache line containing multiple faulty cells is disabled. This is illustrated in Figure 5 . Since cache lines with multiple faulty cells are expected to be small in number, their technique maintains cache capacity even at low supply voltage, while also addressing both hard and soft errors. present a technique to mitigate the overhead of strong ECC schemes for enabling reliable low-voltage operation. They use a fast mechanism to predict ECC information, and the strong error correction scheme itself is employed in parallel to verify the correctness of the predicted values. The predicted ECC values are fed to subsequent pipeline stages. When the value predicted is the same as the output of strong error correction, the latency of strong error correction is hidden. In case of misprediction, dependent instructions are flushed and restarted using the right output. By virtue of having high accuracy and fast prediction, their technique reduces L1 cache latency, which also translates to energy savings. evaluate a filter cache [Kin et al. 1997 ] designed with NTtolerant SRAM and show that such a design reduces the energy of a conventional filter cache (refer Figure 7) . In high-performance mode, however, such design leads to performance loss since the filter cache's effectiveness to capture data locality is limited due to its small size. To avoid this, the filter cache needs to be flushed and then bypassed, which incurs significant overhead. They present a cache architecture for providing large energy savings in low-power mode and minimal runtime overhead in high-performance mode. One cache way is designed using NT-tolerant (e.g., 8T) SRAM, while others are designed using standard SRAM cells. On a cache access, only the NT-tolerant way is accessed; the remaining ways are accessed only if there is a miss in the NT-tolerant way. If there is a hit in the remaining ways, the data are swapped with that in the NT-tolerant way to ensure that MRU (most recently used) data always resides in the NT-tolerant way. Thus, the NT-tolerant way acts as a shield for other ways. If the miss rate in the NT-tolerant way exceeds a threshold, then cache is dynamically reconfigured such that all ways are accessed in parallel. The advantage of this architecture is that unlike the filter cache, it uses NT-tolerant ways even in high-performance mode and does not require flushing. Mahmood and Kim [2011] present a fault-buffer-based technique for achieving reliable low-voltage operation in L1 caches. Their technique identifies and disables faulty cache locations at word level (32 bits). These faulty words are instead allocated in a small fully associative fault buffer array (refer to Figure 7) . To minimize the overhead of the fault buffer, their technique adapts the size of the fault buffer based on its hit rate, such that when the hit rate is higher than a threshold, the size is reduced and vice versa. Also, the fault buffer is divided into multiple banks and only one bank is activated on each access to reduce the delay and dynamic energy overhead.
Using Additional Structures
Using Task-Scheduling Scheme
Karpuzcu et al. [2013] note that in a many-core system, use of multiple on-chip voltage domains is energy inefficient. This is because use of multiple-voltage domains requires on-chip voltage regulators, which have low energy efficiency and consume a significant amount of area. Also, NTC exacerbates parametric variations, and hence, fine-grained domains demand higher guardband margins. Hence, they propose using a single-voltage domain and multiple-frequency domains. The cores are organized in clusters to exploit the systematic component of process variation and each cluster can potentially use a single-frequency domain. They also propose a scheduling scheme that assigns jobs to the cores to maximize performance per watt. They show that using their approach, a chip with a single-voltage domain can provide higher performance per watt than one with multiple-voltage domains.
Using NT-Tolerant Circuit Designs
As discussed in Section 2.2.6, traditional circuit designs are susceptible to higher process variation and failure rates at low voltages. To address this, several NT-tolerant circuit designs have been proposed that reduce the failure rates of the circuit. We now discuss a few NT-tolerant circuit designs.
The scaling of MOSFET has led to increased short channel effects, which harm its performance as a switch. To avoid this effect, circuit designs with improved gate control of the channel, such as double-gate MOSFET, have been explored. Double-gate MOSFET has reduced junction capacitance, and the overlap capacitance dominates its drain capacitance [Goel et al. 2009 ]. An underlap between the source/drain can be used to reduce this overlap capacitance. Of these, underlap on the source side degrades the ON-current and makes the device susceptible to process variation. By comparison, the drain-underlap design reduces the static power consumption and propagation delay and steepens the switching slope of MOSFET [Patil and Qureshi 2011] . Thus, due to these features, the drain-underlap design has been considered promising for nearthreshold voltage operation. The limitation of this design, however, is the requirement of extra fabrication steps and reduced ability to work as a pass-gate transistor [Goel et al. 2009; Patil and Qureshi 2011] .
An 8T SRAM cell [Chang et al. 2005 ] adds a two-transistor read stack to the conventional 6T SRAM cell. The word line of the original 6T cell is used only for write operations, and a second read word line is connected to the read stack. This eliminates cell disturbance on a read access. This also allows improving for writeability, which where, at high voltage, the entire cache is used for storing data, but at low voltage, some cache ways are used for storing additional ECC information for other ways.
improves yield and performance at the NTV operating region [Chang et al. 2005; Kumar and Hinton 2009] .
NEAR-THRESHOLD COMPUTING IN VARIOUS PROCESSOR COMPONENTS
Different system components have different properties, and hence, use of NTC in them presents different constraints and optimization opportunities. To underscore this, in Table III , we classify the research works based on the processor component where NTC is used. In this table, we also classify the works based on whether they have been evaluated on a real processor or a simulator to provide insights. We now discuss several of these works. propose a technique that trades off cache capacity to enable hard/soft-error resilience at lower voltages. At high voltage, only conventional ECC is used and the entire cache is used for storing data. At low voltage, some cache ways are used for storing additional ECC information for other cache ways, at a granularity finer than the cache line, which allows a larger number of errors to be corrected in each line with lower latency and complexity. This is illustrated in Figure 8 . For achieving this, their technique divides a cache line into multiple segments and corrects errors on a per-segment basis. The number of ways used for ECC is dynamically adapted on the basis of the target minimum supply voltage, which influences the desired reliability level. Depending on whether performance is relatively independent of cache size in low-voltage mode, their technique can exercise a tradeoff between cache size or reliability. note that there exists a tradeoff between LLC cell size and its area and reliability, since large cells offer higher reliability but also increase the cache area. They present an approach to jointly optimize the LLC cell size, strength of ECC, and number of redundant cells to minimize the total SRAM area while meeting the minimum-voltage and yield targets. For this, they study the change in SRAM cell failure probability as the size of its transistors is varied. Then, using ECC and/or redundancy, they apply the necessary amount of fault tolerance to achieve a target minimum V dd (supply voltage) for the given cell failure probability, which is determined by the cell size. They show that compared to using either redundant cells or ECC, use of both schemes allows finer adjustment of the overall cache failure probability, which, in turn, allows for achieving a smaller area for a given target minimum V dd .
NTC in Caches
NTC in Processor Core
Miller et al.
[2012a] present a technique to mitigate the effect of process variation and application imbalance in voltage-scaled chips. Their technique provisions two power supply rails for the chip, which are set at different low voltages and can be assigned to a core to change its frequency. Depending on the power constraint, their technique decides how many cores can be assigned to higher voltage at the same time. They present two implementations of their technique. The first implementation aims to mitigate the effect of core-to-core frequency variation to achieve performance homogeneity. For this, their technique schedules slower cores to higher voltages for a longer time and vice versa to achieve nearly the same per-core frequency across all cores over a finite interval. The second implementation aims to reduce workload imbalance that is caused by characteristics of multithreaded applications, such as uneven distribution of work between threads and so forth. For this, their technique uses hints provided by synchronization libraries to determine high-priority threads and the cores running such threads are assigned more time on a high-voltage rail compared to the cores running low-priority threads. Thus, by avoiding idling of threads at synchronization points, their technique improves resource utilization and performance. present a technique to reduce supply voltage while keeping operating frequency high. They note that for SRAM arrays, write operations lie at the critical path at low supply voltage since their latency grows exponentially with decreasing voltage. To increase operating frequency by overriding SRAM write delay constraints, at low voltage, their technique interrupts write operations before bitcells reach a readable state. Since SRAM structures are rarely read immediately after being written, this approach allows the bitcells to stabilize and reach a readable state after several cycles. The improvement in the frequency obtained offsets the stall introduced due to avoiding an immediate read after write (IRAW) operation. They also present several strategies to avoid IRAW for different SRAM blocks of an in-order core. For a register file, issuing of those instructions whose source registers have not stabilized can be delayed. For the instruction queue, issuing of those instructions that have recently been allocated can be avoided. For cache, the read operations can be delayed. For branch predictor and return stack buffers, no strategy is used since an IRAW violation only affects performance and not correctness. Khare and Jain [2013] discuss Intel's NTV research processor, code-named "Claremont." The caches are designed using 10T bitcells, which allows the processor to achieve lower voltage compared to that achieved by using 8T bitcells. Core voltage and performance can be scaled from 1.2V and 915MHz down to 280mV and 3MHz, which reduces total power consumption from 737mW to 2mW. Logic and memory structures function in independent voltage domains, and the minimum voltage achieved in them are 280mV and 550mV, respectively. Minimum energy consumption is achieved at the NTV region (at 0.45V), which provides a 4.7× energy efficiency improvement compared to that at maximum voltage. discuss a 3D processor named Centip3De, which uses NTC to save energy and offset limited power dissipation capabilities of 3D design. The processor has two stacked dies with 64 ARM M3 near-threshold cores, organized in 16 four-core clusters, each connected to 1KB I-cache and 8KB D-cache. Caches are designed using 8T SRAM for reliability. Caches and cores operate in different voltage domains, and each cache operates at 4× the core frequency and communicates with the cores in a round-robin fashion. The cores running the latency-critical threads can be boosted 2, 4, or 8× in frequency by connecting them to a higher voltage while disabling the remaining cores in the cluster to mitigate the higher power consumption. describe Runnemede, a research architecture that seeks to achieve very high energy efficiency. Runnemede uses several techniques to save energy at all layers of the computing stack, for example, fine-grained power and clock management, near-threshold operation, separate execution units for runtime and application code, and other approaches for reducing energy in memory and the on-chip network. Runnemede ensures resilience toward errors arising due to parametric variation at low voltages. Also, due to its operation at low clock rates, a large number of cores are required to achieve high performance, which is accomplished using hardware-software codesign approaches.
NTC in Research Processors/Prototypes
NEAR-THRESHOLD COMPUTING FOR ACHIEVING VARIOUS OPTIMIZATION OBJECTIVES
Near-threshold computing can be used for optimizing a variety of system metrics. To highlight this, in Table IV we classify the works based on the study and optimization objective of a technique. Note that while a technique improving performance may also provide energy savings, in Table IV , we list a technique in a category for which the technique has been actually evaluated. We now review many of these techniques. present a technique to ensure performance guarantee of superthreshold voltage computing (STC) in the NTC region. Their technique works by computing the clock frequency at NTC for sustaining STC performance. Using this, the lowest possible V dd for sustaining this frequency is computed for each core. Based on these, their power management scheme forms voltage island domains and allocates their NTC voltages. Use of such a multiple-voltage single-frequency scheme helps in mitigating the effect of within-die variations on performance and power and in providing an iso-frequency view of the many-core platform. To further improve the performance, they use a multiple-voltage multiple-frequency scheme that allocates multiple frequencies within a single voltage island depending on the process variation-induced heterogeneity at threshold voltage within the chip. This leads to a heterogeneous NTC many-core, which can provide even larger performance than the STC design. They show that compared to a 16-core STC chip, a 128-core NTC chip using their technique can provide significant energy savings.
Improving Energy Efficiency
BanaiyanMofrad et al.
[2011] present a technique that uses a defect map to configure cache architecture to achieve energy savings using voltage scaling. Their technique uses replication of faulty blocks to tolerate faults in them and aims to minimize the number of lines used for replication while tolerating a maximum number of defects. For a faulty subblock (called the host), their technique uses the fault map to find another faulty block in the same or another set, which does not have any faulty subblock in the same position as the host (refer to Figure 9 ). If such a block is not found, their technique finds another faulty set (called the target) that does not have a faulty block in the same position as the host set and replicates all faulty blocks of the host set to the target set. The correct value is reconstructed from a combination of two blocks. The host and target sets are chosen from different banks, and hence, they can be accessed in parallel which reduces the access latency. propose a technique to address within-core variation that arises due to delay variation in functional units at low voltages. Delay variation reduces the core frequency since the frequency of a core is dictated by the critical path delay of the slowest functional unit. Their technique allows the slow units to operate at half the main clock frequency, which moves such units out of the critical path and allows the core frequency to be raised significantly. On the CMP (chip multiprocessor) level, the effect of applying this technique on the slowest core is a significant increase in CMP clock frequency and throughput, which offsets a small loss in performance of individual cores. Cho and Mahlke [2012] present a technique to recover the performance of multithreaded programs in the NTC paradigm. They statically analyze the target application and instrument dynamic monitoring and priority management code into the program. Their technique assigns the cores to fast mode at runtime based on the priority set by the instrumented code, such that the core that is more likely to be included in the critical path has more chances of getting accelerated. This helps in minimizing the waiting time on synchronization operations, which improves the performance.
Reducing Voltage Guardbands
Modern processors use voltage guardbands to avoid errors that can negate the energy savings obtained by using low voltages. To reduce the voltage margins, voltage speculation is used, which works by gradually lowering the supply voltage while keeping the processor frequency constant to save power without harming performance. Voltage speculation, however, requires additional hardware to correct timing errors caused by aggressive speculation. Bacha and Teodorescu [2013] propose a technique for dynamically reducing voltage margins and lowering V dd by directly using the feedback from ECC-protected blocks. They note that as V dd is lowered, correctable errors in the ECCprotected functional unit happen before uncorrectable errors. Hence, during voltage speculation, the occurrence of correctable errors can be used as an indicator for approaching unsafe operating levels. Based on this, their firmware-based technique uses the type and rate of runtime correctable errors to determine the lowest safe voltage point at which each core can operate. By adapting V dd to each core's safe operating level, their technique accounts for core-to-core variation. Overall, their technique selects the voltage level to keep the processor operating close to but above the safety margin to achieve correct operation while saving a large amount of energy.
To avoid the runtime overhead of the firmware-based technique, Bacha and Teodorescu [2014] present a hardware-based technique for guiding voltage speculation in low-voltage processors. They note that correctable errors observed on lowering the V dd occur consistently in the same cache lines, which happens since the cells in these lines are more vulnerable to low voltage than others due to process variation. They also observe that the spread between the V dd at which an error is seen in a sensitive line and the voltage at which the system crashes is 4× larger at low V dd compared to that at nominal V dd . This provides a wide margin of safe operating voltage and allows more aggressive speculation than is possible in the nominal V dd region. Their technique uses a hardware ECC monitor, which continuously tracks the known sensitive cache lines. Based on the feedback from the ECC monitor, a voltage control system adapts V dd in steps of 5mV. They also show that their hardware-based approach provides larger energy savings than the firmware-based approach used in Bacha and Teodorescu [2013] . observe that in the presence of process variation, SRAM bits that fail at some supply voltages also fail at all lower voltages. This property allows the use of a compressible fault map such that multiple supply voltages can be used with little additional overhead compared to a single-voltage fault map. Based on this, their techniques use global voltage scaling of an SRAM data array along with power gating of individual faulty blocks to reduce cache power. Their static technique uses the knowledge of faulty blocks obtained using BIST to choose the optimal cache voltage at boot time to achieve a minimum of 99% fault-free blocks. Since the static technique misses the opportunity to dynamically adapt the cache V dd in response to the workload behavior, they also propose a dynamic technique that adaptively trades off capacity to achieve even further energy savings. When the number of misses becomes high at low voltage, their technique raises V dd to increase the count of functional blocks (and hence the capacity), which lowers the miss count and performance loss. Similarly, when the number of misses becomes low, the V dd is reduced to opportunistically save energy.
Allowing Additional Cores/Parallelism
The energy savings provided by NTC can allow for activating more cores under the same power budget. We now discuss several techniques that use this insight for improving performance. note that in a processor with inclusive cache hierarchy, where faults occur in shared LLC due to voltage scaling, disabling an LLC block that is actively used in first-level private caches may lead to performance loss. To address this, they propose two techniques that utilize coherence schemes to save energy using block disabling while maintaining the performance. The first technique works on the observation that in a directory-based coherence protocol, blocks present in private levels only have to be tracked in tag array, and maintaining their replica in data array of shared LLC is not required. Based on this, the first technique turns on the tags of faulty blocks to ensure directory inclusion. From the perspective of first-level private cache, this restores the associativity of shared LLC, and hence, the block in private cache need not be invalidated. The coherence protocol is adapted so that access to a faulty block is addressed as a cache miss using off-chip access. Since the increase in off-chip misses due to block disabling in this technique can offset the energy savings obtained from low-voltage operation, they also present a second technique. This technique avoids off-chip access when one or more copies of faulty blocks exist in private caches, which happens in the case of parallel workloads. Using directory information, replicated blocks can be tracked, and whenever an L1 request arrives to a faulty LLC block, it is forwarded to another L1, which is a sharer of the requested block. The data obtained is supplied to the requesting L1 using cache-to-cache transfer. They also show that their energy-saving technique allows more cores to be activated within the same power budget, which leads to performance improvement. present a stream processor family that uses massive parallelization and NTV operation of circuits and interconnect. They note that in a stream processor, wide SIMD along with a large number of ALUs (arithmetic logic units) exacerbate the timing variability problem at low voltage. Also, due to the random nature of these variations, the delay in parallel functional units becomes different. They propose two techniques to tolerate such delay variations. Their first technique, which aims to tolerate input-dependent and dynamic variations, allows all functional units to execute the same instruction, but parallel pipelines are allowed to go mutually out of sync so that delay variations can be tolerated independently by each of them. By using extra queues and micro-barriers, timing violations are rebalanced within each pipeline. Their second technique aims to tolerate static variations that cause delay or leakage variations between ALUs. In this technique, shared pipeline components are replicated and the components that do not meet specifications are disabled. This provides fine-grained spatial redundancy, which allows the processor to achieve reliable low-voltage operation. They show that their processor performs more than 1 Gigaoperations per second (1GOP/s) with less than 1mW total power consumption. Fig. 10 . Depending on their criticality, dirty or clean data, higher-or lower-order bits, and control or data operations are allocated to more reliable or less reliable memory/cores.
Application-Domain-Specific Techniques
Karpuzcu et al. [2014] note that power savings provided by NTC allow more cores to be used in computation within the same power budget; however, the limited parallelism in applications presents an obstacle to it. Toward this, they propose a technique suited for RMS (recognition, mining, and synthesis) applications. These applications comply with weak scaling, whereby the problem size expands naturally as the application scales to utilize more cores. Also, these applications show inherent fault tolerance, and by expanding the problem size, the output quality can be increased. However, increasing the problem size requires lowering V dd , which also increases the vulnerability to variations. These factors introduce a tradeoff between the number of cores used and output quality degradation due to variation-induced errors, and their technique uses problem size as the knob to strike the right balance between these factors. Parametric variation also induces reliability differences between cores, and since RMS applications can tolerate faults in data-intensive program phases but not in control-intensive phases, their technique reserves reliable cores for control phases and uses error-prone cores for data-intensive phases (refer to Figure 10) . present a hybrid memory architecture to trade off output quality to save energy in video applications by allowing more aggressive voltage scaling. They note that the human visual system is primarily sensitive to higher-order bits of luminance pixels in video data. Based on this, their technique stores higher-order luma bits in robust 8T bitcells and lower-order bits in conventional 6T bitcells (refer to Figure 10 ). Under voltage scaling, the important luma bits stored in 8T bitcells remain unaffected, and any fault in less important 6T bitcells has little effect on output quality. They show that under the iso-area condition, their hybrid memory architecture provides larger energy savings compared to a 6T bitcell-only memory.
FUTURE CHALLENGES AND CONCLUSION
Despite the potential of low-voltage computing, performance loss and reliability issues caused by it restrict its use to low-power systems only. We believe that these challenges need to be simultaneously addressed at different levels of the system stack. At the device level, novel cell designs are required that provide a better balance between area, performance, and NT-tolerance. At the microarchitecture level, use of performance optimization schemes such as cache miss rate reduction, novel cache organization, use of high-density memory technologies (e.g., embedded DRAM), and so forth can allow use of NTC for even high-end computing systems. Exploiting inherent resilience of applications to faults can allow aggressive voltage scaling without compromising reliability [Chai et al. 2014] . Similarly, compiler and OS techniques can be leveraged to further enhance application resilience by using approaches such as profiling, instruction scheduling, altering processor component occupancy and memory layout, and so forth [Lee and Shrivastava 2009] .
Given the tall energy efficiency targets of future systems, it is clear that no single technique can bridge the energy efficiency gap between existing and future systems. In the near future, synergistic integration of NTC with other energy-saving approaches, such as data compression, thermal management for temperature reduction, and so forth, will be extremely important and pose a major challenge for researchers.
Most of the existing NTC techniques have been proposed in the context of CPUs. As GPUs contend to become the first-class citizens of a power-limited computing world, their competitiveness vis-à-vis other computing systems such as FPGA and CPU will crucially depend on their energy efficiency [Mittal and Vetter 2015b] . Porting existing NTC techniques to GPUs and even designing novel techniques for GPUs will be vital research problems for designers.
In this article, we synthesized the techniques proposed for using low-voltage computing and specifically near-threshold voltage computing. We classified the techniques on several key features to provide a bird's-eye view of the research field. We concluded this article with a brief mention of some challenges that lie ahead in this field. We believe that this survey will provide valuable insights to researchers into the potential and tradeoffs of NTC and motivate them to further improve the efficacy and adoption of NTC techniques across all computation platforms.
